The Project Proposal covers the following sections:-
What is the relationship between happiness indicators and life expectancy from a global perspective?
There is always an ongoing debate about whether money can buy people happiness in the Malaysian community. While most people agree that money cannot buy a happy life from a sentimental point of view, it is undeniable that people can expect to live a quality life with money. Moreover, a higher quality of life generally indicates greater happiness linked to improved health, leading to higher life expectancy (Lozano and Sole-Auro, 2021). This motivates Wisteria Team to explore how happiness determinants correlate with one’s life expectancy.
Wisteria team will be exploring how the happiness indicators (independent variables) provided in the World Happiness Report 2022 correlate with life expectancy (dependent variable). The happiness indicators refer to scores used to calculate happiness rankings across countries, whereas life expectancy is defined as “the number of years a person can expect to live” (Murillo, 2016.)
The World Happiness Report provides data for both happiness indicators and life expectancy across different countries.
| No. | Hapiness Indicators | Meanings |
|---|---|---|
| 1 | log GDP per capita | Measurement of the economic output of a nation per person (Investopedia). |
| 2 | social support | National average of the binary responses (1 = YES, 0 = NO) to the GWP ("Gallup World Poll") question: "If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?". |
| 3 | healthy life expectancy at birth | An estimate of the average number of years babies born this year would live in a state of good general health if mortality levels and good health level at each age remain constant in the future (Goverment of UK). |
| 4 | freedom to make life choices | National average of the binary responses (1 = YES, 0 = NO) to the GWP question "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?" |
| 5 | generosity | Residual of regressing the national average of GWP responses to the donation question: "Have you donated money to a charity in the past month?" on log GDP per capita. |
| 6 | perceptions of corruptions | National average of the binary responses (1 = YES, 0 = NO) to the 2 GWP questions: "Is corruption widespread throughout the government in this country or not?" and "Is corruption widespread within businesses in this country or not?" |
| 7 | positive affect | Average of previous-day affect measures for laughter, enjoyment, and doing or learning something interesting through series of affect questions. |
| 8 | negative affect | Average of previous-day affect measures for worry, sadness, and anger through series of affect questions. |
| 9 | life ladder | Happiness score determined by national average respones to the questions of life evaluations. |
| 10 | age dependency ratio | Ratio of dependent population to the working population which indicates financial stress level. |
From a data-informed point of view, the vast amount of information in the Report and the motivation mentioned earlier have inspired Wisteria Team to conduct a small research on the relationship between happiness indicators and life expectancy.
Upon starting the data understanding stage, the Wisteria Team determined to scan through the Report and conduct further reading. The Report stated that the variables (happiness indicators) were taken from the Gallup World Poll surveys from 2019 to 2021. In other words, these variables originated from the answers participants provided in a series of life evaluation questions in the survey. The responses were then translated into scores (categorical to continuous data), allowing the team to carry out a quantitative study.
From a sociological perspective, the Wisteria Team skimmed through a book read by millions worldwide, “The Top Five Regrets of the Dying - A Life Transformed by the Dearly Departing”. In this book, the Dying expressed their deepest regrets at the end of their lives. The team collected these regrets to understand better their research topic regarding the implicit link between happiness, regrets and life expectancy. Furthermore, the team utilises this chance, hoping to spread awareness in the Malaysian community on the significance of being happy.
Top 5 regrets of the dying include:
I wish I had stayed in touch with my friends
I wish I had let myself be happier
Being happy is a choice.
Murillo, I.L. (2016). The life expectancy: what is it and why does it matter. Cenie.
Firstly, import pandas and sklearn libraries.
import pandas as pd
import sklearn as sk
Get CSV from our file directory.
df_age_ratio = pd.read_csv('Age_dependency_ratio.csv',skiprows=4)
df_happiness_2021 = pd.read_csv('world-happiness-report-2021.csv')
df_happiness = pd.read_csv('world-happiness-report.csv')
First and foremost, df_happiness was our base line dataframe. We were going to integrate df_age_ratio in df_hapiness dataframe.
The strategy was to pick the exact value in df_age_ratio based on country and year.
Row : country
Column: year
col_country_name = df_happiness["Country name"]
col_year = df_happiness["year"]
new_col = []
country_cannot_be_matched = []
for i in range(len(col_country_name)):
#print(col_country_name[i],str(col_year[i]))
try:
new_col.append(df_age_ratio[df_age_ratio["Country Name"].isin([col_country_name[i]])][str(col_year[i])].iloc[0])
except:
country_cannot_be_matched.append(col_country_name[i])
new_col.append("None")
set(country_cannot_be_matched)
{'Congo (Brazzaville)',
'Congo (Kinshasa)',
'Egypt',
'Gambia',
'Hong Kong S.A.R. of China',
'Iran',
'Ivory Coast',
'Kyrgyzstan',
'Laos',
'North Cyprus',
'Palestinian Territories',
'Russia',
'Slovakia',
'Somaliland region',
'South Korea',
'Swaziland',
'Syria',
'Taiwan Province of China',
'Venezuela',
'Yemen'}
However, there were few countries could not be found. We looked into it, and discerned 3 possible scenarios which might have caused this issue:
For scenario 1 and 2, we would create a dict to transform the name of undetected countries, and match the country name in df_hapiness. Whilst for scenario 3, we would insert "None" value.
rematching_dict = {'Congo (Brazzaville)' : "Congo, Rep.",
'Congo (Kinshasa)' : "Congo, Dem. Rep.",
"Egypt" : "Egypt, Arab Rep.",
'Gambia' : "Gambia, The",
'Hong Kong S.A.R. of China' : "Hong Kong SAR, China",
'Iran' : "Iran, Islamic Rep.",
"Ivory Coast" :"Cote d'Ivoire",
'Kyrgyzstan' :"Kyrgyz Republic",
"Laos" : "Lao PDR",
'North Cyprus' : 'Cyprus',
"Russia":"Russian Federation",
"Slovakia":"Slovak Republic",
'Somaliland region' :"Somalia",
'South Korea':"Korea, Rep.",
"Syria" : "Syrian Arab Republic",
"Venezuela" :"Venezuela, RB",
"Swaziland":"Eswatini",
"Yemen" : "Yemen, Rep."}
new_col = []
country_cannot_be_matched = []
for i in range(len(col_country_name)):
country = col_country_name[i]
try:
if col_country_name[i] in rematching_dict:
# print(rematching_dict[col_country_name[i]])
country =rematching_dict[col_country_name[i]]
new_col.append(df_age_ratio[df_age_ratio["Country Name"].isin([country])][str(col_year[i])].iloc[0])
except:
country_cannot_be_matched.append(col_country_name[i])
new_col.append("None")
set(country_cannot_be_matched)
#IGNORED, will delete those country later
{'Palestinian Territories', 'Taiwan Province of China'}
df_happiness.insert(len(df_happiness.columns), 'age_ratio', new_col)
df_happiness
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2008 | 3.724 | 7.370 | 0.451 | 50.80 | 0.718 | 0.168 | 0.882 | 0.518 | 0.258 | 102.078659 |
| 1 | Afghanistan | 2009 | 4.402 | 7.540 | 0.552 | 51.20 | 0.679 | 0.190 | 0.850 | 0.584 | 0.237 | 102.249014 |
| 2 | Afghanistan | 2010 | 4.758 | 7.647 | 0.539 | 51.60 | 0.600 | 0.121 | 0.707 | 0.618 | 0.275 | 102.045823 |
| 3 | Afghanistan | 2011 | 3.832 | 7.620 | 0.521 | 51.92 | 0.496 | 0.162 | 0.731 | 0.611 | 0.267 | 100.224461 |
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.24 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1944 | Zimbabwe | 2016 | 3.735 | 7.984 | 0.768 | 54.40 | 0.733 | -0.095 | 0.724 | 0.738 | 0.209 | 83.576729 |
| 1945 | Zimbabwe | 2017 | 3.638 | 8.016 | 0.754 | 55.00 | 0.753 | -0.098 | 0.751 | 0.806 | 0.224 | 83.466245 |
| 1946 | Zimbabwe | 2018 | 3.616 | 8.049 | 0.775 | 55.60 | 0.763 | -0.068 | 0.844 | 0.710 | 0.212 | 82.951113 |
| 1947 | Zimbabwe | 2019 | 2.694 | 7.950 | 0.759 | 56.20 | 0.632 | -0.064 | 0.831 | 0.716 | 0.235 | 82.277964 |
| 1948 | Zimbabwe | 2020 | 3.160 | 7.829 | 0.717 | 56.80 | 0.643 | -0.009 | 0.789 | 0.703 | 0.346 | 81.571496 |
1949 rows × 12 columns
# save to directory
df_happiness.to_csv("hapi_age.csv",index=False)
The next integration step was to update the df_hapiness until year 2021. From the data source, we found that there were 2 excel files. The first excel file consisted of data in year 2021 whilst the second one from year 2000 to year 2020. Therefore, we should update our dataset by integrating df_hapiness_2021 into df_hapiness.
df_happiness_2021 = pd.read_csv('world-happiness-report-2021.csv')
df_happiness = pd.read_csv('hapi_age.csv')
Here, we would use the inner join (intercept) concept to combine df_happiness_2021 and df_happiness by matching each column existed in df_hapiness. In order to do this, we needed to rename some column and add new column in df_hapiness_2021, so that the column is matching.
*left join might be possible
#rename column in 2021 similar to 2020
df_happiness_2021.rename(columns = {'Ladder score':'Life Ladder', 'Logged GDP per capita':'Log GDP per capita','Healthy life expectancy':'Healthy life expectancy at birth'}, inplace = True)
df_happiness_2021.insert(len(df_happiness.columns), 'age_ratio', '')
df_happiness_2021.insert(len(df_happiness.columns), 'Positive affect', '')
df_happiness_2021.insert(len(df_happiness.columns), 'Negative affect', '')
df_happiness_2021.insert(len(df_happiness.columns), 'year', 2021)
all_col_hapiness_2021 = df_happiness_2021.columns
all_col_hapiness_all = df_happiness.columns
intersect_col = all_col_hapiness_all.intersection(all_col_hapiness_2021)
print(intersect_col,all_col_hapiness_all)
assert(intersect_col.equals(all_col_hapiness_all))
Index(['Country name', 'year', 'Life Ladder', 'Log GDP per capita',
'Social support', 'Healthy life expectancy at birth',
'Freedom to make life choices', 'Generosity',
'Perceptions of corruption', 'Positive affect', 'Negative affect',
'age_ratio'],
dtype='object') Index(['Country name', 'year', 'Life Ladder', 'Log GDP per capita',
'Social support', 'Healthy life expectancy at birth',
'Freedom to make life choices', 'Generosity',
'Perceptions of corruption', 'Positive affect', 'Negative affect',
'age_ratio'],
dtype='object')
new_df = pd.concat([df_happiness, df_happiness_2021],join='inner', ignore_index=True)
sort_df = new_df.sort_values(by = 'Country name')
assert(len(sort_df.index)==(len(df_happiness_2021.index)+len(df_happiness.index)))
sort_df
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2008 | 3.724 | 7.370 | 0.451 | 50.80 | 0.718 | 0.168 | 0.882 | 0.518 | 0.258 | 102.078658552986 |
| 11 | Afghanistan | 2019 | 2.375 | 7.697 | 0.420 | 52.40 | 0.394 | -0.108 | 0.924 | 0.351 | 0.502 | 82.1097716001822 |
| 10 | Afghanistan | 2018 | 2.694 | 7.692 | 0.508 | 52.60 | 0.374 | -0.094 | 0.928 | 0.424 | 0.405 | 84.0776554602003 |
| 9 | Afghanistan | 2017 | 2.662 | 7.697 | 0.491 | 52.80 | 0.427 | -0.121 | 0.954 | 0.496 | 0.371 | 86.0007546392816 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.00 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.9417880743028 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1936 | Zimbabwe | 2008 | 3.174 | 7.461 | 0.843 | 44.14 | 0.344 | -0.090 | 0.964 | 0.631 | 0.25 | 79.9062332058443 |
| 1935 | Zimbabwe | 2007 | 3.280 | 7.666 | 0.828 | 42.86 | 0.456 | -0.082 | 0.946 | 0.661 | 0.265 | 79.6756171295196 |
| 1934 | Zimbabwe | 2006 | 3.826 | 7.711 | 0.822 | 41.58 | 0.431 | -0.076 | 0.905 | 0.715 | 0.297 | 79.6946129295014 |
| 1942 | Zimbabwe | 2014 | 4.184 | 7.991 | 0.766 | 52.38 | 0.642 | -0.074 | 0.820 | 0.725 | 0.239 | 82.8400413568859 |
| 1941 | Zimbabwe | 2013 | 4.690 | 7.985 | 0.799 | 50.96 | 0.576 | -0.104 | 0.831 | 0.712 | 0.182 | 82.3502519981812 |
2098 rows × 12 columns
sort_df.to_csv("hapi_age_latest.csv",index=False)
We completed our dataset integration upon successful creation of hapi_age_latest.csv file.
Next, we proceeded to subsequent data preprocessing task that focuses on data cleaning and data reduction.
Step 1: Remove Duplicates
We started by importing libraries needed, and creating a copy of sort_df - df.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import pearsonr
df = sort_df.copy()
df
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2008 | 3.724 | 7.370 | 0.451 | 50.80 | 0.718 | 0.168 | 0.882 | 0.518 | 0.258 | 102.078658552986 |
| 11 | Afghanistan | 2019 | 2.375 | 7.697 | 0.420 | 52.40 | 0.394 | -0.108 | 0.924 | 0.351 | 0.502 | 82.1097716001822 |
| 10 | Afghanistan | 2018 | 2.694 | 7.692 | 0.508 | 52.60 | 0.374 | -0.094 | 0.928 | 0.424 | 0.405 | 84.0776554602003 |
| 9 | Afghanistan | 2017 | 2.662 | 7.697 | 0.491 | 52.80 | 0.427 | -0.121 | 0.954 | 0.496 | 0.371 | 86.0007546392816 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.00 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.9417880743028 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1936 | Zimbabwe | 2008 | 3.174 | 7.461 | 0.843 | 44.14 | 0.344 | -0.090 | 0.964 | 0.631 | 0.25 | 79.9062332058443 |
| 1935 | Zimbabwe | 2007 | 3.280 | 7.666 | 0.828 | 42.86 | 0.456 | -0.082 | 0.946 | 0.661 | 0.265 | 79.6756171295196 |
| 1934 | Zimbabwe | 2006 | 3.826 | 7.711 | 0.822 | 41.58 | 0.431 | -0.076 | 0.905 | 0.715 | 0.297 | 79.6946129295014 |
| 1942 | Zimbabwe | 2014 | 4.184 | 7.991 | 0.766 | 52.38 | 0.642 | -0.074 | 0.820 | 0.725 | 0.239 | 82.8400413568859 |
| 1941 | Zimbabwe | 2013 | 4.690 | 7.985 | 0.799 | 50.96 | 0.576 | -0.104 | 0.831 | 0.712 | 0.182 | 82.3502519981812 |
2098 rows × 12 columns
We used drop_duplicates() function to return dataframe with duplicate rows removed. The result showed all 2098 rows, indicating the dataset did not contain duplicate rows. In this case, we proceeded to changing of data type prior of outliers removal
df_drop = df.drop_duplicates()
print(df_drop)
Country name year Life Ladder Log GDP per capita Social support \
0 Afghanistan 2008 3.724 7.370 0.451
11 Afghanistan 2019 2.375 7.697 0.420
10 Afghanistan 2018 2.694 7.692 0.508
9 Afghanistan 2017 2.662 7.697 0.491
8 Afghanistan 2016 4.220 7.697 0.559
... ... ... ... ... ...
1936 Zimbabwe 2008 3.174 7.461 0.843
1935 Zimbabwe 2007 3.280 7.666 0.828
1934 Zimbabwe 2006 3.826 7.711 0.822
1942 Zimbabwe 2014 4.184 7.991 0.766
1941 Zimbabwe 2013 4.690 7.985 0.799
Healthy life expectancy at birth Freedom to make life choices \
0 50.80 0.718
11 52.40 0.394
10 52.60 0.374
9 52.80 0.427
8 53.00 0.523
... ... ...
1936 44.14 0.344
1935 42.86 0.456
1934 41.58 0.431
1942 52.38 0.642
1941 50.96 0.576
Generosity Perceptions of corruption Positive affect Negative affect \
0 0.168 0.882 0.518 0.258
11 -0.108 0.924 0.351 0.502
10 -0.094 0.928 0.424 0.405
9 -0.121 0.954 0.496 0.371
8 0.042 0.793 0.565 0.348
... ... ... ... ...
1936 -0.090 0.964 0.631 0.25
1935 -0.082 0.946 0.661 0.265
1934 -0.076 0.905 0.715 0.297
1942 -0.074 0.820 0.725 0.239
1941 -0.104 0.831 0.712 0.182
age_ratio
0 102.078658552986
11 82.1097716001822
10 84.0776554602003
9 86.0007546392816
8 87.9417880743028
... ...
1936 79.9062332058443
1935 79.6756171295196
1934 79.6946129295014
1942 82.8400413568859
1941 82.3502519981812
[2098 rows x 12 columns]
Step 2: Change of Data Type
We used df.info() to check the concise summary for our data frame.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2098 entries, 0 to 1941 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2098 non-null object 1 year 2098 non-null int64 2 Life Ladder 2098 non-null float64 3 Log GDP per capita 2062 non-null float64 4 Social support 2085 non-null float64 5 Healthy life expectancy at birth 2043 non-null float64 6 Freedom to make life choices 2066 non-null float64 7 Generosity 2009 non-null float64 8 Perceptions of corruption 1988 non-null float64 9 Positive affect 2076 non-null object 10 Negative affect 2082 non-null object 11 age_ratio 2084 non-null object dtypes: float64(7), int64(1), object(4) memory usage: 213.1+ KB
It's expected for 'country name' attribute to have 'object' data type properties. Thus, no data type change is needed.
'Positive affect', 'Negative effect' and 'age_ratio' would requires further checking to see if there is any abnormal values present in it.
So, we utilized value_counts() function to get the counts of unique value from these attributes.
df['Positive affect'].value_counts()
149
0.833 12
0.784 12
0.832 12
0.82 12
...
0.72 1
0.465 1
0.895 1
0.885 1
0.427 1
Name: Positive affect, Length: 432, dtype: int64
df['Negative affect'].value_counts()
149
0.206 17
0.232 16
0.243 15
0.26 14
...
0.496 1
0.14 1
0.512 1
0.544 1
0.442 1
Name: Negative affect, Length: 375, dtype: int64
df.age_ratio.value_counts()
149
None 27
41.5149338958792 2
41.9915591016913 2
42.850097415439 2
...
85.2903115632256 1
50.1437637425205 1
36.2316084308117 1
47.0200238173727 1
41.4821233095685 1
Name: age_ratio, Length: 1903, dtype: int64
As we can see, 'age_ratio' contain non numeric data inside it.
In order to investigate further, we proceed to retrieve rows of data that has 'age_ratio' as 'None' variable.
df_2 = df.loc[df['age_ratio'] == 'None']
df_2
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1338 | Palestinian Territories | 2019 | 4.483 | NaN | 0.833 | NaN | 0.653 | NaN | 0.829 | 0.625 | 0.4 | None |
| 1328 | Palestinian Territories | 2009 | 4.470 | 8.329 | 0.738 | 62.132 | 0.468 | -0.085 | 0.797 | 0.544 | 0.466 | None |
| 1326 | Palestinian Territories | 2007 | 4.151 | 8.218 | 0.712 | 61.897 | 0.365 | -0.080 | 0.844 | 0.566 | 0.412 | None |
| 1337 | Palestinian Territories | 2018 | 4.554 | NaN | 0.819 | NaN | 0.655 | NaN | 0.814 | 0.61 | 0.419 | None |
| 1329 | Palestinian Territories | 2010 | 4.703 | 8.383 | 0.822 | 62.250 | 0.504 | -0.117 | 0.752 | 0.628 | 0.381 | None |
| 1336 | Palestinian Territories | 2017 | 4.628 | 8.485 | 0.824 | NaN | 0.632 | -0.163 | 0.831 | 0.597 | 0.416 | None |
| 1331 | Palestinian Territories | 2012 | 4.647 | 8.531 | 0.782 | NaN | 0.542 | -0.153 | 0.730 | 0.616 | 0.379 | None |
| 1334 | Palestinian Territories | 2015 | 4.695 | 8.480 | 0.766 | NaN | 0.556 | -0.153 | 0.774 | 0.594 | 0.369 | None |
| 1333 | Palestinian Territories | 2014 | 4.722 | 8.457 | 0.775 | NaN | 0.657 | -0.147 | 0.804 | 0.565 | 0.38 | None |
| 1332 | Palestinian Territories | 2013 | 4.844 | 8.489 | 0.761 | NaN | 0.454 | -0.150 | 0.780 | 0.594 | 0.365 | None |
| 1330 | Palestinian Territories | 2011 | 4.751 | 8.474 | 0.751 | NaN | 0.522 | -0.127 | 0.750 | 0.567 | 0.388 | None |
| 1325 | Palestinian Territories | 2006 | 4.716 | 8.213 | 0.818 | 61.780 | 0.547 | NaN | 0.858 | 0.497 | 0.431 | None |
| 1335 | Palestinian Territories | 2016 | 4.907 | 8.498 | 0.818 | NaN | 0.608 | -0.129 | 0.812 | 0.593 | 0.378 | None |
| 1327 | Palestinian Territories | 2008 | 4.386 | 8.276 | 0.666 | 62.015 | 0.358 | -0.070 | 0.753 | 0.571 | 0.403 | None |
| 1681 | Taiwan Province of China | 2020 | 6.751 | NaN | 0.901 | NaN | 0.799 | NaN | 0.711 | 0.845 | 0.083 | None |
| 1680 | Taiwan Province of China | 2019 | 6.537 | NaN | 0.893 | NaN | 0.814 | NaN | 0.718 | 0.86 | 0.093 | None |
| 1679 | Taiwan Province of China | 2018 | 6.467 | NaN | 0.896 | NaN | 0.741 | NaN | 0.736 | 0.848 | 0.093 | None |
| 1678 | Taiwan Province of China | 2017 | 6.359 | 10.871 | 0.891 | NaN | 0.760 | -0.070 | 0.743 | 0.837 | 0.114 | None |
| 1676 | Taiwan Province of China | 2015 | 6.450 | 10.842 | 0.885 | NaN | 0.701 | 0.019 | 0.857 | 0.832 | 0.129 | None |
| 1677 | Taiwan Province of China | 2016 | 6.513 | 10.855 | 0.895 | NaN | 0.719 | -0.049 | 0.811 | 0.833 | 0.108 | None |
| 1674 | Taiwan Province of China | 2013 | 6.340 | 10.750 | 0.817 | NaN | 0.690 | 0.002 | 0.841 | 0.846 | 0.124 | None |
| 1673 | Taiwan Province of China | 2012 | 6.126 | 10.716 | 0.825 | NaN | 0.698 | 0.022 | 0.803 | 0.821 | 0.14 | None |
| 1672 | Taiwan Province of China | 2011 | 6.309 | 10.705 | 0.863 | NaN | 0.761 | 0.035 | 0.755 | 0.827 | 0.112 | None |
| 1671 | Taiwan Province of China | 2010 | 6.229 | 10.691 | 0.831 | 69.600 | 0.677 | 0.005 | 0.821 | 0.845 | 0.136 | None |
| 1670 | Taiwan Province of China | 2008 | 5.548 | 10.606 | 0.830 | 69.140 | 0.642 | -0.017 | 0.785 | 0.794 | 0.169 | None |
| 1669 | Taiwan Province of China | 2006 | 6.189 | 10.613 | 0.882 | 68.680 | 0.630 | -0.030 | 0.846 | 0.814 | 0.094 | None |
| 1675 | Taiwan Province of China | 2014 | 6.363 | 10.798 | 0.870 | NaN | 0.693 | 0.092 | 0.866 | 0.849 | 0.108 | None |
We noticed that there are multiple columns from 'age_ratio' with 'None' values.
With the help of np.unique(), we identified that 'Palestinian Territories' and 'Taiwan Province of China' having this data issue.
import numpy as np
np.unique(df_2['Country name'])
array(['Palestinian Territories', 'Taiwan Province of China'],
dtype=object)
df.shape
(2098, 12)
df_2.shape
(27, 12)
Since we have 2098 rows of preprocessed data, we decided to remove records from 'Palestinian Territories' and 'Taiwan Province of China', in view that these countries do not contain any 'age_ratio' reference values across multiple year.
# Drop a row by condition
df = df[df['age_ratio'] != 'None']
df.shape
(2071, 12)
Next, we proceed to change data type of 'Positive affect', 'Negative affect' and 'age_ratio' to numeric data type.
#Convert selected columns to numeric and fills bad values with 'nan'
df['Positive affect'] = pd.to_numeric(df['Positive affect'])
df['Negative affect'] = pd.to_numeric(df['Negative affect'])
df['age_ratio'] = pd.to_numeric(df['age_ratio'])
<ipython-input-30-05b2e41f0fc3>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['Positive affect'] = pd.to_numeric(df['Positive affect']) <ipython-input-30-05b2e41f0fc3>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['Negative affect'] = pd.to_numeric(df['Negative affect']) <ipython-input-30-05b2e41f0fc3>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['age_ratio'] = pd.to_numeric(df['age_ratio'])
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2071 entries, 0 to 1941 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country name 2071 non-null object 1 year 2071 non-null int64 2 Life Ladder 2071 non-null float64 3 Log GDP per capita 2040 non-null float64 4 Social support 2058 non-null float64 5 Healthy life expectancy at birth 2035 non-null float64 6 Freedom to make life choices 2039 non-null float64 7 Generosity 1988 non-null float64 8 Perceptions of corruption 1961 non-null float64 9 Positive affect 1900 non-null float64 10 Negative affect 1906 non-null float64 11 age_ratio 1908 non-null float64 dtypes: float64(10), int64(1), object(1) memory usage: 210.3+ KB
Step 3: Remove Outliers
Before calling the function, we decided to overlook which variables contained outliers by sketching boxplot.
sns.boxplot(x = df['Life Ladder'])
<AxesSubplot:xlabel='Life Ladder'>
sns.boxplot(x = df['Log GDP per capita'])
<AxesSubplot:xlabel='Log GDP per capita'>
sns.boxplot(x = df['Social support'])
<AxesSubplot:xlabel='Social support'>
sns.boxplot(x = df['Healthy life expectancy at birth'])
<AxesSubplot:xlabel='Healthy life expectancy at birth'>
sns.boxplot(x = df['Freedom to make life choices'])
<AxesSubplot:xlabel='Freedom to make life choices'>
sns.boxplot(x = df['Generosity'])
<AxesSubplot:xlabel='Generosity'>
sns.boxplot(x = df['Perceptions of corruption'])
<AxesSubplot:xlabel='Perceptions of corruption'>
sns.boxplot(x = df['Positive affect'])
<AxesSubplot:xlabel='Positive affect'>
sns.boxplot(x = df['Negative affect'])
<AxesSubplot:xlabel='Negative affect'>
sns.boxplot(x = df['age_ratio'])
<AxesSubplot:xlabel='age_ratio'>
From the boxplots shown above, we could deduce that:
(i) 'Life Ladder' and 'Log GDP per capita' did not contain outliers;
(ii) 'Social support', 'Healthy life expectancy at birth', 'Freedom to make life choices', 'Generosity', and 'Perceptions of corruption', 'Positive affect', 'Negative affect' & 'age_ratio' contain outliers.
Next, we used Interquartile Rule to detect outliers. Since Python discerned 'year' as one of the numerical variables, we decided to drop the 'year' column at this stage.
df.pop('year')
0 2008
11 2019
10 2018
9 2017
8 2016
...
1936 2008
1935 2007
1934 2006
1942 2014
1941 2013
Name: year, Length: 2071, dtype: int64
We calculated Q1, Q3, and IQR for the df.
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
Life Ladder 1.624500 Log GDP per capita 1.879250 Social support 0.158000 Healthy life expectancy at birth 9.960000 Freedom to make life choices 0.207000 Generosity 0.203250 Perceptions of corruption 0.182000 Positive affect 0.171000 Negative affect 0.112000 age_ratio 20.339253 dtype: float64
After obtaining IQR of variables, we computed lower fence and upper fence function to find outliers.
print(df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))
<ipython-input-44-a02a94f243ef>:1: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right` print(df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)) <ipython-input-44-a02a94f243ef>:1: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right` print(df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))
Country name Freedom to make life choices Generosity \
0 False False False
11 False False False
10 False False False
9 False False False
8 False False False
... ... ... ...
1936 False False False
1935 False False False
1934 False False False
1942 False False False
1941 False False False
Healthy life expectancy at birth Life Ladder Log GDP per capita \
0 False False False
11 False False False
10 False False False
9 False False False
8 False False False
... ... ... ...
1936 False False False
1935 True False False
1934 True False False
1942 False False False
1941 False False False
Negative affect Perceptions of corruption Positive affect \
0 False False False
11 False False True
10 False False False
9 False False False
8 False False False
... ... ... ...
1936 False False False
1935 False False False
1934 False False False
1942 False False False
1941 False False False
Social support age_ratio
0 True False
11 True False
10 True False
9 True False
8 False False
... ... ...
1936 False False
1935 False False
1934 False False
1942 False False
1941 False False
[2071 rows x 11 columns]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op) 264 # (xint or xbool) and (yint or bool) --> 265 result = op(x, y) 266 except TypeError: ~\anaconda3\lib\site-packages\pandas\core\ops\roperator.py in ror_(left, right) 55 def ror_(left, right): ---> 56 return operator.or_(right, left) 57 TypeError: unsupported operand type(s) for |: 'NoneType' and 'bool' During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op) 278 try: --> 279 result = libops.scalar_binop(x, y, op) 280 except ( pandas\_libs\ops.pyx in pandas._libs.ops.scalar_binop() ValueError: Buffer has wrong number of dimensions (expected 1, got 2) The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) <ipython-input-44-a02a94f243ef> in <module> ----> 1 print(df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)) ~\anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other) 63 other = item_from_zerodim(other) 64 ---> 65 return method(self, other) 66 67 return new_method ~\anaconda3\lib\site-packages\pandas\core\arraylike.py in __ror__(self, other) 69 @unpack_zerodim_and_defer("__ror__") 70 def __ror__(self, other): ---> 71 return self._logical_method(other, roperator.ror_) 72 73 @unpack_zerodim_and_defer("__xor__") ~\anaconda3\lib\site-packages\pandas\core\frame.py in _arith_method(self, other, op) 5980 self, other = ops.align_method_FRAME(self, other, axis, flex=True, level=None) 5981 -> 5982 new_data = self._dispatch_frame_op(other, op, axis=axis) 5983 return self._construct_result(new_data) 5984 ~\anaconda3\lib\site-packages\pandas\core\frame.py in _dispatch_frame_op(self, right, func, axis) 6006 if not is_list_like(right): 6007 # i.e. scalar, faster than checking np.ndim(right) == 0 -> 6008 bm = self._mgr.apply(array_op, right=right) 6009 return type(self)(bm) 6010 ~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs) 423 try: 424 if callable(f): --> 425 applied = b.apply(f, **kwargs) 426 else: 427 applied = getattr(b, f)(**kwargs) ~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in apply(self, func, **kwargs) 376 """ 377 with np.errstate(all="ignore"): --> 378 result = func(self.values, **kwargs) 379 380 return self._split_op_result(result) ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in logical_op(left, right, op) 353 filler = fill_int if is_self_int_dtype and is_other_int_dtype else fill_bool 354 --> 355 res_values = na_logical_op(lvalues, rvalues, op) 356 # error: Cannot call function of unknown type 357 res_values = filler(res_values) # type: ignore[operator] ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op) 286 ) as err: 287 typ = type(y).__name__ --> 288 raise TypeError( 289 f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array " 290 f"and scalar of type [{typ}]" TypeError: Cannot perform 'ror_' with a dtyped [bool] array and scalar of type [NoneType]
Outliers were detected and we called tilde function with axis = 1 to drop rows containing outliers. The df_out.shape told us how many rows and columns we left with, which were 1715 rows and 11 columns.
df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df_out.shape
#dataset with outliers (to be verified)
#df_OUTLIERS = df[~((df > (Q1 - 1.5 * IQR)) |(df < (Q3 + 1.5 * IQR))).any(axis=1)]
<ipython-input-45-ec09880c2dfd>:1: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right` df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)] <ipython-input-45-ec09880c2dfd>:1: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right` df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
(1715, 11)
df_out
| Country name | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | Afghanistan | 4.220 | 7.697 | 0.559 | 53.000 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 7 | Afghanistan | 3.983 | 7.702 | 0.529 | 53.200 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 4 | Afghanistan | 3.783 | 7.705 | 0.521 | 52.240 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 6 | Afghanistan | 3.131 | 7.718 | 0.526 | 52.880 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 2041 | Albania | 5.117 | 9.520 | 0.697 | 68.999 | 0.785 | -0.030 | 0.901 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1938 | Zimbabwe | 4.682 | 7.729 | 0.857 | 46.700 | 0.665 | -0.093 | 0.828 | 0.748 | 0.122 | 80.772063 |
| 1937 | Zimbabwe | 4.056 | 7.563 | 0.806 | 45.420 | 0.411 | -0.078 | 0.931 | 0.736 | 0.218 | 80.287926 |
| 1936 | Zimbabwe | 3.174 | 7.461 | 0.843 | 44.140 | 0.344 | -0.090 | 0.964 | 0.631 | 0.250 | 79.906233 |
| 1942 | Zimbabwe | 4.184 | 7.991 | 0.766 | 52.380 | 0.642 | -0.074 | 0.820 | 0.725 | 0.239 | 82.840041 |
| 1941 | Zimbabwe | 4.690 | 7.985 | 0.799 | 50.960 | 0.576 | -0.104 | 0.831 | 0.712 | 0.182 | 82.350252 |
1715 rows × 11 columns
Lastly, we inserted the 'year' column of sort_df back into our free-of-duplicates-and-outliers dataset - df_out. Here done our first two steps of data cleaning.
sort_df['year']
0 2008
11 2019
10 2018
9 2017
8 2016
...
1936 2008
1935 2007
1934 2006
1942 2014
1941 2013
Name: year, Length: 2098, dtype: int64
df_out.insert(loc=1, column="year", value=sort_df['year'])
df_out
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.000 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.200 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.240 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.880 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 2041 | Albania | 2021 | 5.117 | 9.520 | 0.697 | 68.999 | 0.785 | -0.030 | 0.901 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1938 | Zimbabwe | 2010 | 4.682 | 7.729 | 0.857 | 46.700 | 0.665 | -0.093 | 0.828 | 0.748 | 0.122 | 80.772063 |
| 1937 | Zimbabwe | 2009 | 4.056 | 7.563 | 0.806 | 45.420 | 0.411 | -0.078 | 0.931 | 0.736 | 0.218 | 80.287926 |
| 1936 | Zimbabwe | 2008 | 3.174 | 7.461 | 0.843 | 44.140 | 0.344 | -0.090 | 0.964 | 0.631 | 0.250 | 79.906233 |
| 1942 | Zimbabwe | 2014 | 4.184 | 7.991 | 0.766 | 52.380 | 0.642 | -0.074 | 0.820 | 0.725 | 0.239 | 82.840041 |
| 1941 | Zimbabwe | 2013 | 4.690 | 7.985 | 0.799 | 50.960 | 0.576 | -0.104 | 0.831 | 0.712 | 0.182 | 82.350252 |
1715 rows × 12 columns
Step 4: Check Redundancy
Furthermore, we utilised Pearson correlation and heatmap to indicate redundancy among variables. The result showed that the correlation coefficient among variables ranged from -0.73 to 0.81. After examining variable relationship with high correlation coefficient, we determined that they were significant features each in the dataset such as 'Life Ladder', 'Log GDP per capita', 'Healthy life expectancy at birth', and 'age_ratio' therefore we kept all variables.
heat = df_out.iloc[:,:]
corr_mat_heat = heat.corr().round(2)
print(corr_mat_heat)
plt.figure(figsize = (10,8))
plot = sns.heatmap(heat.corr().round(2))
year Life Ladder Log GDP per capita \
year 1.00 0.01 0.05
Life Ladder 0.01 1.00 0.73
Log GDP per capita 0.05 0.73 1.00
Social support -0.04 0.66 0.65
Healthy life expectancy at birth 0.14 0.70 0.81
Freedom to make life choices 0.24 0.45 0.25
Generosity -0.05 0.14 -0.11
Perceptions of corruption -0.11 -0.31 -0.18
Positive affect -0.03 0.47 0.20
Negative affect 0.21 -0.17 -0.10
age_ratio -0.02 -0.50 -0.73
Social support \
year -0.04
Life Ladder 0.66
Log GDP per capita 0.65
Social support 1.00
Healthy life expectancy at birth 0.56
Freedom to make life choices 0.33
Generosity 0.02
Perceptions of corruption -0.13
Positive affect 0.35
Negative affect -0.33
age_ratio -0.44
Healthy life expectancy at birth \
year 0.14
Life Ladder 0.70
Log GDP per capita 0.81
Social support 0.56
Healthy life expectancy at birth 1.00
Freedom to make life choices 0.29
Generosity -0.04
Perceptions of corruption -0.20
Positive affect 0.24
Negative affect -0.03
age_ratio -0.71
Freedom to make life choices Generosity \
year 0.24 -0.05
Life Ladder 0.45 0.14
Log GDP per capita 0.25 -0.11
Social support 0.33 0.02
Healthy life expectancy at birth 0.29 -0.04
Freedom to make life choices 1.00 0.29
Generosity 0.29 1.00
Perceptions of corruption -0.38 -0.21
Positive affect 0.56 0.32
Negative affect -0.16 -0.07
age_ratio -0.19 0.12
Perceptions of corruption Positive affect \
year -0.11 -0.03
Life Ladder -0.31 0.47
Log GDP per capita -0.18 0.20
Social support -0.13 0.35
Healthy life expectancy at birth -0.20 0.24
Freedom to make life choices -0.38 0.56
Generosity -0.21 0.32
Perceptions of corruption 1.00 -0.23
Positive affect -0.23 1.00
Negative affect 0.14 -0.27
age_ratio 0.02 -0.07
Negative affect age_ratio
year 0.21 -0.02
Life Ladder -0.17 -0.50
Log GDP per capita -0.10 -0.73
Social support -0.33 -0.44
Healthy life expectancy at birth -0.03 -0.71
Freedom to make life choices -0.16 -0.19
Generosity -0.07 0.12
Perceptions of corruption 0.14 0.02
Positive affect -0.27 -0.07
Negative affect 1.00 0.08
age_ratio 0.08 1.00
Step 5: Data Cleaning
We created another copy named 'cleaning_wip_df' to prevent creation of dirty data at the original 'df_out' data frame.
cleaning_wip_df = df_out
Null Value Handling
We verified that our dataset contained null value.
# to check if there is any Null value in merged data set
cleaning_wip_df.isnull().values.any()
True
There was a total of 683 null data in our data set to be handled.
# calculate the total number of Null datum
cleaning_wip_df.isnull().sum().sum()
print('Count of NaN = ' + str(cleaning_wip_df.isnull().sum().sum()))
Count of NaN = 683
The total counts of null data for each attributes was summarised as:
# determine the total number of Null data per column
cleaning_wip_df.isnull().sum()
Country name 0 year 0 Life Ladder 0 Log GDP per capita 19 Social support 10 Healthy life expectancy at birth 24 Freedom to make life choices 28 Generosity 64 Perceptions of corruption 101 Positive affect 149 Negative affect 144 age_ratio 144 dtype: int64
Let's rearrange our data according to Country name and year attribute in ascending order before we proceeded to null value handling.
# print out the top 10 columns of data frame sorted based on Country name and year
cleaning_wip_df = cleaning_wip_df.sort_values(by =['Country name', 'year'])
cleaning_wip_df.head(25)
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.240 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.880 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.200 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.000 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 12 | Albania | 2007 | 4.634 | 9.142 | 0.821 | 65.800 | 0.529 | -0.009 | 0.875 | 0.553 | 0.246 | 51.604342 |
| 13 | Albania | 2009 | 5.485 | 9.262 | 0.833 | 66.200 | 0.525 | -0.158 | 0.864 | 0.640 | 0.279 | 50.044078 |
| 14 | Albania | 2010 | 5.269 | 9.303 | 0.733 | 66.400 | 0.569 | -0.172 | 0.726 | 0.648 | 0.300 | 49.477909 |
| 15 | Albania | 2011 | 5.867 | 9.331 | 0.759 | 66.680 | 0.487 | -0.205 | 0.877 | 0.628 | 0.257 | 48.118058 |
| 16 | Albania | 2012 | 5.510 | 9.347 | 0.785 | 66.960 | 0.602 | -0.169 | 0.848 | 0.607 | 0.271 | 47.033080 |
| 17 | Albania | 2013 | 4.551 | 9.359 | 0.759 | 67.240 | 0.632 | -0.127 | 0.863 | 0.634 | 0.338 | 46.256656 |
| 18 | Albania | 2014 | 4.814 | 9.378 | 0.626 | 67.520 | 0.735 | -0.025 | 0.883 | 0.685 | 0.335 | 45.774680 |
| 19 | Albania | 2015 | 4.607 | 9.403 | 0.639 | 67.800 | 0.704 | -0.081 | 0.885 | 0.688 | 0.350 | 45.550402 |
| 20 | Albania | 2016 | 4.511 | 9.437 | 0.638 | 68.100 | 0.730 | -0.017 | 0.901 | 0.675 | 0.322 | 45.645034 |
| 21 | Albania | 2017 | 4.640 | 9.476 | 0.638 | 68.400 | 0.750 | -0.029 | 0.876 | 0.669 | 0.334 | 45.682097 |
| 22 | Albania | 2018 | 5.004 | 9.518 | 0.684 | 68.700 | 0.824 | 0.009 | 0.899 | 0.713 | 0.319 | 45.810037 |
| 23 | Albania | 2019 | 4.995 | 9.544 | 0.686 | 69.000 | 0.777 | -0.099 | 0.914 | 0.681 | 0.274 | 46.203522 |
| 24 | Albania | 2020 | 5.365 | 9.497 | 0.710 | 69.300 | 0.754 | 0.007 | 0.891 | 0.679 | 0.265 | 46.930147 |
| 2041 | Albania | 2021 | 5.117 | 9.520 | 0.697 | 68.999 | 0.785 | -0.030 | 0.901 | NaN | NaN | NaN |
| 25 | Algeria | 2010 | 5.464 | 9.287 | NaN | 64.500 | 0.593 | -0.205 | 0.618 | NaN | NaN | 48.674537 |
| 26 | Algeria | 2011 | 5.317 | 9.297 | 0.810 | 64.660 | 0.530 | -0.181 | 0.638 | 0.550 | 0.255 | 49.213095 |
| 27 | Algeria | 2012 | 5.605 | 9.311 | 0.839 | 64.820 | 0.587 | -0.172 | 0.690 | 0.604 | 0.230 | 49.778219 |
| 28 | Algeria | 2014 | 6.355 | 9.335 | 0.818 | 65.140 | NaN | NaN | NaN | 0.626 | 0.177 | 51.509189 |
| 29 | Algeria | 2016 | 5.341 | 9.362 | 0.749 | 65.500 | NaN | NaN | NaN | 0.661 | 0.377 | 54.184014 |
| 30 | Algeria | 2017 | 5.249 | 9.354 | 0.807 | 65.700 | 0.437 | -0.167 | 0.700 | 0.642 | 0.289 | 55.804001 |
| 31 | Algeria | 2018 | 5.043 | 9.348 | 0.799 | 65.900 | 0.583 | -0.146 | 0.759 | 0.591 | 0.293 | 57.508033 |
We utilised linear interpolation to estimate missing values for:
# replace NaN value with linear interpolate method for numeric value
cleaning_wip_df.replace = cleaning_wip_df['Log GDP per capita'].interpolate(method='linear', inplace=True)
cleaning_wip_df.replace = cleaning_wip_df['Social support'].interpolate(method='linear', inplace=True)
cleaning_wip_df.replace = cleaning_wip_df['Healthy life expectancy at birth'].interpolate(method='linear', inplace=True)
cleaning_wip_df.replace = cleaning_wip_df['Freedom to make life choices'].interpolate(method='linear', inplace=True)
cleaning_wip_df.replace = cleaning_wip_df['Generosity'].interpolate(method='linear', inplace=True)
cleaning_wip_df.replace = cleaning_wip_df['Perceptions of corruption'].interpolate(method='linear', inplace=True)
cleaning_wip_df.replace = cleaning_wip_df['Positive affect'].interpolate(method='linear', inplace=True)
cleaning_wip_df.replace = cleaning_wip_df['Negative affect'].interpolate(method='linear', inplace=True)
cleaning_wip_df.head(30)
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.5210 | 52.240 | 0.531 | 0.236000 | 0.776000 | 0.710 | 0.268000 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.5260 | 52.880 | 0.509 | 0.104000 | 0.871000 | 0.532 | 0.375000 | 92.649143 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.5290 | 53.200 | 0.389 | 0.080000 | 0.881000 | 0.554 | 0.339000 | 89.954092 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.5590 | 53.000 | 0.523 | 0.042000 | 0.793000 | 0.565 | 0.348000 | 87.941788 |
| 12 | Albania | 2007 | 4.634 | 9.142 | 0.8210 | 65.800 | 0.529 | -0.009000 | 0.875000 | 0.553 | 0.246000 | 51.604342 |
| 13 | Albania | 2009 | 5.485 | 9.262 | 0.8330 | 66.200 | 0.525 | -0.158000 | 0.864000 | 0.640 | 0.279000 | 50.044078 |
| 14 | Albania | 2010 | 5.269 | 9.303 | 0.7330 | 66.400 | 0.569 | -0.172000 | 0.726000 | 0.648 | 0.300000 | 49.477909 |
| 15 | Albania | 2011 | 5.867 | 9.331 | 0.7590 | 66.680 | 0.487 | -0.205000 | 0.877000 | 0.628 | 0.257000 | 48.118058 |
| 16 | Albania | 2012 | 5.510 | 9.347 | 0.7850 | 66.960 | 0.602 | -0.169000 | 0.848000 | 0.607 | 0.271000 | 47.033080 |
| 17 | Albania | 2013 | 4.551 | 9.359 | 0.7590 | 67.240 | 0.632 | -0.127000 | 0.863000 | 0.634 | 0.338000 | 46.256656 |
| 18 | Albania | 2014 | 4.814 | 9.378 | 0.6260 | 67.520 | 0.735 | -0.025000 | 0.883000 | 0.685 | 0.335000 | 45.774680 |
| 19 | Albania | 2015 | 4.607 | 9.403 | 0.6390 | 67.800 | 0.704 | -0.081000 | 0.885000 | 0.688 | 0.350000 | 45.550402 |
| 20 | Albania | 2016 | 4.511 | 9.437 | 0.6380 | 68.100 | 0.730 | -0.017000 | 0.901000 | 0.675 | 0.322000 | 45.645034 |
| 21 | Albania | 2017 | 4.640 | 9.476 | 0.6380 | 68.400 | 0.750 | -0.029000 | 0.876000 | 0.669 | 0.334000 | 45.682097 |
| 22 | Albania | 2018 | 5.004 | 9.518 | 0.6840 | 68.700 | 0.824 | 0.009000 | 0.899000 | 0.713 | 0.319000 | 45.810037 |
| 23 | Albania | 2019 | 4.995 | 9.544 | 0.6860 | 69.000 | 0.777 | -0.099000 | 0.914000 | 0.681 | 0.274000 | 46.203522 |
| 24 | Albania | 2020 | 5.365 | 9.497 | 0.7100 | 69.300 | 0.754 | 0.007000 | 0.891000 | 0.679 | 0.265000 | 46.930147 |
| 2041 | Albania | 2021 | 5.117 | 9.520 | 0.6970 | 68.999 | 0.785 | -0.030000 | 0.901000 | 0.636 | 0.261667 | NaN |
| 25 | Algeria | 2010 | 5.464 | 9.287 | 0.7535 | 64.500 | 0.593 | -0.205000 | 0.618000 | 0.593 | 0.258333 | 48.674537 |
| 26 | Algeria | 2011 | 5.317 | 9.297 | 0.8100 | 64.660 | 0.530 | -0.181000 | 0.638000 | 0.550 | 0.255000 | 49.213095 |
| 27 | Algeria | 2012 | 5.605 | 9.311 | 0.8390 | 64.820 | 0.587 | -0.172000 | 0.690000 | 0.604 | 0.230000 | 49.778219 |
| 28 | Algeria | 2014 | 6.355 | 9.335 | 0.8180 | 65.140 | 0.537 | -0.170333 | 0.693333 | 0.626 | 0.177000 | 51.509189 |
| 29 | Algeria | 2016 | 5.341 | 9.362 | 0.7490 | 65.500 | 0.487 | -0.168667 | 0.696667 | 0.661 | 0.377000 | 54.184014 |
| 30 | Algeria | 2017 | 5.249 | 9.354 | 0.8070 | 65.700 | 0.437 | -0.167000 | 0.700000 | 0.642 | 0.289000 | 55.804001 |
| 31 | Algeria | 2018 | 5.043 | 9.348 | 0.7990 | 65.900 | 0.583 | -0.146000 | 0.759000 | 0.591 | 0.293000 | 57.508033 |
| 32 | Algeria | 2019 | 4.745 | 9.337 | 0.8030 | 66.100 | 0.385 | 0.005000 | 0.741000 | 0.585 | 0.215000 | 58.990490 |
| 2057 | Algeria | 2021 | 4.887 | 9.342 | 0.8020 | 66.005 | 0.480 | -0.067000 | 0.752000 | 0.622 | 0.288000 | NaN |
| 33 | Angola | 2011 | 5.589 | 8.946 | 0.7230 | 52.500 | 0.584 | 0.055000 | 0.911000 | 0.659 | 0.361000 | 97.988307 |
| 34 | Angola | 2012 | 4.360 | 8.992 | 0.7530 | 53.200 | 0.456 | -0.136000 | 0.906000 | 0.558 | 0.305000 | 98.145447 |
| 35 | Angola | 2013 | 3.937 | 9.005 | 0.7220 | 53.900 | 0.410 | -0.104000 | 0.816000 | 0.658 | 0.371000 | 98.130463 |
Linear interpolation was not applied to 'age_ratio' because it was country name dependent.
Thus, we decided to fill in the NaN 'age_ratio' with calculated mean for respective country.
cleaning_wip_df['age_ratio'] = cleaning_wip_df['age_ratio'].fillna(cleaning_wip_df.groupby('Country name')['age_ratio'].transform('mean'))
cleaning_wip_df.head(30)
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.5210 | 52.240 | 0.531 | 0.236000 | 0.776000 | 0.710 | 0.268000 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.5260 | 52.880 | 0.509 | 0.104000 | 0.871000 | 0.532 | 0.375000 | 92.649143 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.5290 | 53.200 | 0.389 | 0.080000 | 0.881000 | 0.554 | 0.339000 | 89.954092 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.5590 | 53.000 | 0.523 | 0.042000 | 0.793000 | 0.565 | 0.348000 | 87.941788 |
| 12 | Albania | 2007 | 4.634 | 9.142 | 0.8210 | 65.800 | 0.529 | -0.009000 | 0.875000 | 0.553 | 0.246000 | 51.604342 |
| 13 | Albania | 2009 | 5.485 | 9.262 | 0.8330 | 66.200 | 0.525 | -0.158000 | 0.864000 | 0.640 | 0.279000 | 50.044078 |
| 14 | Albania | 2010 | 5.269 | 9.303 | 0.7330 | 66.400 | 0.569 | -0.172000 | 0.726000 | 0.648 | 0.300000 | 49.477909 |
| 15 | Albania | 2011 | 5.867 | 9.331 | 0.7590 | 66.680 | 0.487 | -0.205000 | 0.877000 | 0.628 | 0.257000 | 48.118058 |
| 16 | Albania | 2012 | 5.510 | 9.347 | 0.7850 | 66.960 | 0.602 | -0.169000 | 0.848000 | 0.607 | 0.271000 | 47.033080 |
| 17 | Albania | 2013 | 4.551 | 9.359 | 0.7590 | 67.240 | 0.632 | -0.127000 | 0.863000 | 0.634 | 0.338000 | 46.256656 |
| 18 | Albania | 2014 | 4.814 | 9.378 | 0.6260 | 67.520 | 0.735 | -0.025000 | 0.883000 | 0.685 | 0.335000 | 45.774680 |
| 19 | Albania | 2015 | 4.607 | 9.403 | 0.6390 | 67.800 | 0.704 | -0.081000 | 0.885000 | 0.688 | 0.350000 | 45.550402 |
| 20 | Albania | 2016 | 4.511 | 9.437 | 0.6380 | 68.100 | 0.730 | -0.017000 | 0.901000 | 0.675 | 0.322000 | 45.645034 |
| 21 | Albania | 2017 | 4.640 | 9.476 | 0.6380 | 68.400 | 0.750 | -0.029000 | 0.876000 | 0.669 | 0.334000 | 45.682097 |
| 22 | Albania | 2018 | 5.004 | 9.518 | 0.6840 | 68.700 | 0.824 | 0.009000 | 0.899000 | 0.713 | 0.319000 | 45.810037 |
| 23 | Albania | 2019 | 4.995 | 9.544 | 0.6860 | 69.000 | 0.777 | -0.099000 | 0.914000 | 0.681 | 0.274000 | 46.203522 |
| 24 | Albania | 2020 | 5.365 | 9.497 | 0.7100 | 69.300 | 0.754 | 0.007000 | 0.891000 | 0.679 | 0.265000 | 46.930147 |
| 2041 | Albania | 2021 | 5.117 | 9.520 | 0.6970 | 68.999 | 0.785 | -0.030000 | 0.901000 | 0.636 | 0.261667 | 47.240772 |
| 25 | Algeria | 2010 | 5.464 | 9.287 | 0.7535 | 64.500 | 0.593 | -0.205000 | 0.618000 | 0.593 | 0.258333 | 48.674537 |
| 26 | Algeria | 2011 | 5.317 | 9.297 | 0.8100 | 64.660 | 0.530 | -0.181000 | 0.638000 | 0.550 | 0.255000 | 49.213095 |
| 27 | Algeria | 2012 | 5.605 | 9.311 | 0.8390 | 64.820 | 0.587 | -0.172000 | 0.690000 | 0.604 | 0.230000 | 49.778219 |
| 28 | Algeria | 2014 | 6.355 | 9.335 | 0.8180 | 65.140 | 0.537 | -0.170333 | 0.693333 | 0.626 | 0.177000 | 51.509189 |
| 29 | Algeria | 2016 | 5.341 | 9.362 | 0.7490 | 65.500 | 0.487 | -0.168667 | 0.696667 | 0.661 | 0.377000 | 54.184014 |
| 30 | Algeria | 2017 | 5.249 | 9.354 | 0.8070 | 65.700 | 0.437 | -0.167000 | 0.700000 | 0.642 | 0.289000 | 55.804001 |
| 31 | Algeria | 2018 | 5.043 | 9.348 | 0.7990 | 65.900 | 0.583 | -0.146000 | 0.759000 | 0.591 | 0.293000 | 57.508033 |
| 32 | Algeria | 2019 | 4.745 | 9.337 | 0.8030 | 66.100 | 0.385 | 0.005000 | 0.741000 | 0.585 | 0.215000 | 58.990490 |
| 2057 | Algeria | 2021 | 4.887 | 9.342 | 0.8020 | 66.005 | 0.480 | -0.067000 | 0.752000 | 0.622 | 0.288000 | 53.207697 |
| 33 | Angola | 2011 | 5.589 | 8.946 | 0.7230 | 52.500 | 0.584 | 0.055000 | 0.911000 | 0.659 | 0.361000 | 97.988307 |
| 34 | Angola | 2012 | 4.360 | 8.992 | 0.7530 | 53.200 | 0.456 | -0.136000 | 0.906000 | 0.558 | 0.305000 | 98.145447 |
| 35 | Angola | 2013 | 3.937 | 9.005 | 0.7220 | 53.900 | 0.410 | -0.104000 | 0.816000 | 0.658 | 0.371000 | 98.130463 |
Then, we rechecked the total number of null data again to ensure we successfully filled in estimated values above.
Based on the result, we could see there were remaining 19 rows of NaN values.
# determine the total number of Null data per column
cleaning_wip_df.isnull().sum()
Country name 0 year 0 Life Ladder 0 Log GDP per capita 0 Social support 0 Healthy life expectancy at birth 0 Freedom to make life choices 0 Generosity 0 Perceptions of corruption 0 Positive affect 0 Negative affect 0 age_ratio 19 dtype: int64
We managed to determine that these 19 rows of records belonged to Kosovo, Mali, Nigerr, Palestine Territories, and Taiwan Province of China.
cleaning_wip_df[cleaning_wip_df['age_ratio'].isnull()]
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 905 | Kosovo | 2007 | 5.104 | 8.9280 | 0.848 | 60.911267 | 0.3810 | 0.144 | 0.894 | 0.6550 | 0.2370 | NaN |
| 906 | Kosovo | 2008 | 5.522 | 8.9810 | 0.884 | 61.118533 | 0.4435 | 0.090 | 0.849 | 0.6265 | 0.3180 | NaN |
| 907 | Kosovo | 2009 | 5.891 | 9.0080 | 0.830 | 61.325800 | 0.5060 | 0.201 | 0.968 | 0.5980 | 0.1690 | NaN |
| 908 | Kosovo | 2010 | 5.177 | 9.0330 | 0.708 | 61.533067 | 0.4510 | 0.170 | 0.967 | 0.6950 | 0.1180 | NaN |
| 909 | Kosovo | 2011 | 4.860 | 9.0670 | 0.759 | 61.740333 | 0.5890 | 0.004 | 0.919 | 0.6960 | 0.1240 | NaN |
| 910 | Kosovo | 2012 | 5.640 | 9.0860 | 0.757 | 61.947600 | 0.6360 | 0.027 | 0.950 | 0.5960 | 0.1000 | NaN |
| 911 | Kosovo | 2013 | 6.126 | 9.1130 | 0.721 | 62.154867 | 0.5680 | 0.115 | 0.935 | 0.6920 | 0.2030 | NaN |
| 912 | Kosovo | 2014 | 5.000 | 9.1290 | 0.706 | 62.362133 | 0.4410 | 0.012 | 0.775 | 0.6360 | 0.2060 | NaN |
| 913 | Kosovo | 2015 | 5.077 | 9.1820 | 0.805 | 62.569400 | 0.5610 | 0.181 | 0.851 | 0.7530 | 0.1800 | NaN |
| 914 | Kosovo | 2016 | 5.759 | 9.2280 | 0.824 | 62.776667 | 0.8270 | 0.125 | 0.941 | 0.7040 | 0.1500 | NaN |
| 915 | Kosovo | 2017 | 6.149 | 9.2620 | 0.792 | 62.983933 | 0.8580 | 0.117 | 0.925 | 0.7380 | 0.1860 | NaN |
| 916 | Kosovo | 2018 | 6.392 | 9.2960 | 0.822 | 63.191200 | 0.8900 | 0.269 | 0.922 | 0.7780 | 0.1700 | NaN |
| 917 | Kosovo | 2019 | 6.425 | 9.3390 | 0.843 | 63.398467 | 0.8410 | 0.247 | 0.920 | 0.7490 | 0.1410 | NaN |
| 918 | Kosovo | 2020 | 6.294 | 9.3285 | 0.792 | 63.605733 | 0.8800 | 0.252 | 0.910 | 0.7260 | 0.2010 | NaN |
| 1981 | Kosovo | 2021 | 6.372 | 9.3180 | 0.821 | 63.813000 | 0.8690 | 0.257 | 0.917 | 0.7220 | 0.2265 | NaN |
| 2065 | Mali | 2021 | 4.723 | 7.7440 | 0.724 | 51.969000 | 0.6970 | -0.036 | 0.827 | 0.7244 | 0.3352 | NaN |
| 2044 | Niger | 2021 | 5.074 | 7.0980 | 0.641 | 53.780000 | 0.8060 | 0.018 | 0.693 | 0.7990 | 0.2310 | NaN |
| 2073 | Palestinian Territories | 2021 | 4.517 | 8.4850 | 0.826 | 62.250000 | 0.6530 | -0.163 | 0.821 | 0.7570 | 0.2960 | NaN |
| 1972 | Taiwan Province of China | 2021 | 6.584 | 10.8710 | 0.898 | 69.600000 | 0.7840 | -0.070 | 0.721 | 0.5620 | 0.2100 | NaN |
We decided to remove these these listing from our dataset.
cleaning_wip_df = cleaning_wip_df[cleaning_wip_df.age_ratio >= 1]
# determine the total number of Null data per column
cleaning_wip_df.isnull().sum()
Country name 0 year 0 Life Ladder 0 Log GDP per capita 0 Social support 0 Healthy life expectancy at birth 0 Freedom to make life choices 0 Generosity 0 Perceptions of corruption 0 Positive affect 0 Negative affect 0 age_ratio 0 dtype: int64
We used described function to get the summary of our processed data.
There left 1696 total rows of records in our processed dataset.
There were 150 unique countries. Kazakhstan is one of the the country with highest frequency, 16 in total.
Our processed data ranged from year 2005 to year 2021.
# to study the basic statistic of each attributes in cleaned data sets
stats = cleaning_wip_df.describe(include='all')
print (stats)
Country name year Life Ladder Log GDP per capita \
count 1696 1696.000000 1696.000000 1696.000000
unique 150 NaN NaN NaN
top Russia NaN NaN NaN
freq 16 NaN NaN NaN
mean NaN 2013.830189 5.443066 9.375139
std NaN 4.508602 0.985338 1.027889
min NaN 2005.000000 2.694000 6.678000
25% NaN 2010.000000 4.730000 8.564750
50% NaN 2014.000000 5.395500 9.473500
75% NaN 2018.000000 6.148500 10.211000
max NaN 2021.000000 7.632000 11.592000
Social support Healthy life expectancy at birth \
count 1696.000000 1696.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 0.819403 63.741887
std 0.101633 6.658627
min 0.511000 43.900000
25% 0.759750 59.840000
50% 0.838000 65.300000
75% 0.902000 68.100000
max 0.985000 75.200000
Freedom to make life choices Generosity Perceptions of corruption \
count 1696.000000 1696.000000 1696.000000
unique NaN NaN NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 0.741432 -0.026302 0.786315
std 0.129162 0.143375 0.121323
min 0.344000 -0.335000 0.415000
25% 0.651000 -0.131000 0.720000
50% 0.759000 -0.047000 0.809000
75% 0.841000 0.063250 0.875000
max 0.985000 0.391000 0.983000
Positive affect Negative affect age_ratio
count 1696.000000 1696.000000 1696.000000
unique NaN NaN NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 0.706951 0.271816 57.614414
std 0.101181 0.074195 15.641881
min 0.384000 0.095000 17.082104
25% 0.631000 0.219000 47.311080
50% 0.714000 0.267900 53.477715
75% 0.788000 0.321000 64.883070
max 0.944000 0.483000 98.315236
cleaning_wip_df
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.240 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.880 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.200 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.000 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 12 | Albania | 2007 | 4.634 | 9.142 | 0.821 | 65.800 | 0.529 | -0.009 | 0.875 | 0.553 | 0.246 | 51.604342 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1945 | Zimbabwe | 2017 | 3.638 | 8.016 | 0.754 | 55.000 | 0.753 | -0.098 | 0.751 | 0.806 | 0.224 | 83.466245 |
| 1946 | Zimbabwe | 2018 | 3.616 | 8.049 | 0.775 | 55.600 | 0.763 | -0.068 | 0.844 | 0.710 | 0.212 | 82.951113 |
| 1947 | Zimbabwe | 2019 | 2.694 | 7.950 | 0.759 | 56.200 | 0.632 | -0.064 | 0.831 | 0.716 | 0.235 | 82.277964 |
| 1948 | Zimbabwe | 2020 | 3.160 | 7.829 | 0.717 | 56.800 | 0.643 | -0.009 | 0.789 | 0.703 | 0.346 | 81.571496 |
| 2096 | Zimbabwe | 2021 | 3.145 | 7.943 | 0.750 | 56.201 | 0.677 | -0.047 | 0.821 | 0.703 | 0.346 | 82.011114 |
1696 rows × 12 columns
TEST
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2008 | 3.724 | 7.370 | 0.451 | 50.800 | 0.718 | 0.168 | 0.882 | 0.518 | 0.258 | 102.078659 |
| 11 | Afghanistan | 2019 | 2.375 | 7.697 | 0.420 | 52.400 | 0.394 | -0.108 | 0.924 | 0.351 | 0.502 | 82.109772 |
| 10 | Afghanistan | 2018 | 2.694 | 7.692 | 0.508 | 52.600 | 0.374 | -0.094 | 0.928 | 0.424 | 0.405 | 84.077655 |
| 9 | Afghanistan | 2017 | 2.662 | 7.697 | 0.491 | 52.800 | 0.427 | -0.121 | 0.954 | 0.496 | 0.371 | 86.000755 |
| 2097 | Afghanistan | 2021 | 2.523 | 7.695 | 0.463 | 52.493 | 0.382 | -0.102 | 0.924 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1833 | United Kingdom | 2018 | 7.233 | 10.743 | 0.928 | 72.300 | 0.838 | 0.226 | 0.404 | 0.783 | 0.228 | 56.430746 |
| 1821 | United Kingdom | 2005 | 6.984 | 10.663 | 0.979 | 69.900 | 0.922 | NaN | 0.398 | 0.864 | 0.262 | 51.545791 |
| 1825 | United Kingdom | 2010 | 7.029 | 10.649 | 0.955 | 71.300 | 0.841 | 0.403 | 0.587 | 0.863 | 0.176 | 51.665045 |
| 1935 | Zimbabwe | 2007 | 3.280 | 7.666 | 0.828 | 42.860 | 0.456 | -0.082 | 0.946 | 0.661 | 0.265 | 79.675617 |
| 1934 | Zimbabwe | 2006 | 3.826 | 7.711 | 0.822 | 41.580 | 0.431 | -0.076 | 0.905 | 0.715 | 0.297 | 79.694613 |
356 rows × 12 columns
Due to the abnormal distribution of our pre-processed data set, we decided to add back the outliers again for our further analysis in Part 2 of the assignment.
result = cleaning_wip_df.append([TEST])
result
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | age_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.24 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.88 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.20 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.00 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 12 | Albania | 2007 | 4.634 | 9.142 | 0.821 | 65.80 | 0.529 | -0.009 | 0.875 | 0.553 | 0.246 | 51.604342 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1833 | United Kingdom | 2018 | 7.233 | 10.743 | 0.928 | 72.30 | 0.838 | 0.226 | 0.404 | 0.783 | 0.228 | 56.430746 |
| 1821 | United Kingdom | 2005 | 6.984 | 10.663 | 0.979 | 69.90 | 0.922 | NaN | 0.398 | 0.864 | 0.262 | 51.545791 |
| 1825 | United Kingdom | 2010 | 7.029 | 10.649 | 0.955 | 71.30 | 0.841 | 0.403 | 0.587 | 0.863 | 0.176 | 51.665045 |
| 1935 | Zimbabwe | 2007 | 3.280 | 7.666 | 0.828 | 42.86 | 0.456 | -0.082 | 0.946 | 0.661 | 0.265 | 79.675617 |
| 1934 | Zimbabwe | 2006 | 3.826 | 7.711 | 0.822 | 41.58 | 0.431 | -0.076 | 0.905 | 0.715 | 0.297 | 79.694613 |
2052 rows × 12 columns
Rename age_ratio to Age ratio.
result = result.rename(columns = {'age_ratio' : 'Age ratio'})
result
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | Age ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.24 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.88 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.20 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.00 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 12 | Albania | 2007 | 4.634 | 9.142 | 0.821 | 65.80 | 0.529 | -0.009 | 0.875 | 0.553 | 0.246 | 51.604342 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1833 | United Kingdom | 2018 | 7.233 | 10.743 | 0.928 | 72.30 | 0.838 | 0.226 | 0.404 | 0.783 | 0.228 | 56.430746 |
| 1821 | United Kingdom | 2005 | 6.984 | 10.663 | 0.979 | 69.90 | 0.922 | NaN | 0.398 | 0.864 | 0.262 | 51.545791 |
| 1825 | United Kingdom | 2010 | 7.029 | 10.649 | 0.955 | 71.30 | 0.841 | 0.403 | 0.587 | 0.863 | 0.176 | 51.665045 |
| 1935 | Zimbabwe | 2007 | 3.280 | 7.666 | 0.828 | 42.86 | 0.456 | -0.082 | 0.946 | 0.661 | 0.265 | 79.675617 |
| 1934 | Zimbabwe | 2006 | 3.826 | 7.711 | 0.822 | 41.58 | 0.431 | -0.076 | 0.905 | 0.715 | 0.297 | 79.694613 |
2052 rows × 12 columns
result.head(10)
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | Age ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.24 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 6 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.88 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 7 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.20 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 8 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.00 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 12 | Albania | 2007 | 4.634 | 9.142 | 0.821 | 65.80 | 0.529 | -0.009 | 0.875 | 0.553 | 0.246 | 51.604342 |
| 13 | Albania | 2009 | 5.485 | 9.262 | 0.833 | 66.20 | 0.525 | -0.158 | 0.864 | 0.640 | 0.279 | 50.044078 |
| 14 | Albania | 2010 | 5.269 | 9.303 | 0.733 | 66.40 | 0.569 | -0.172 | 0.726 | 0.648 | 0.300 | 49.477909 |
| 15 | Albania | 2011 | 5.867 | 9.331 | 0.759 | 66.68 | 0.487 | -0.205 | 0.877 | 0.628 | 0.257 | 48.118058 |
| 16 | Albania | 2012 | 5.510 | 9.347 | 0.785 | 66.96 | 0.602 | -0.169 | 0.848 | 0.607 | 0.271 | 47.033080 |
| 17 | Albania | 2013 | 4.551 | 9.359 | 0.759 | 67.24 | 0.632 | -0.127 | 0.863 | 0.634 | 0.338 | 46.256656 |
result.isnull().sum()
Country name 0 year 0 Life Ladder 0 Log GDP per capita 12 Social support 3 Healthy life expectancy at birth 12 Freedom to make life choices 4 Generosity 19 Perceptions of corruption 9 Positive affect 22 Negative affect 21 Age ratio 19 dtype: int64
result.isnull().sum()
Country name 0 year 0 Life Ladder 0 Log GDP per capita 12 Social support 3 Healthy life expectancy at birth 12 Freedom to make life choices 4 Generosity 19 Perceptions of corruption 9 Positive affect 22 Negative affect 21 Age ratio 19 dtype: int64
result.replace = result['Log GDP per capita'].interpolate(method='linear', inplace=True)
result.replace = result['Social support'].interpolate(method='linear', inplace=True)
result.replace = result['Healthy life expectancy at birth'].interpolate(method='linear', inplace=True)
result.replace = result['Freedom to make life choices'].interpolate(method='linear', inplace=True)
result.replace = result['Generosity'].interpolate(method='linear', inplace=True)
result.replace = result['Perceptions of corruption'].interpolate(method='linear', inplace=True)
result.replace = result['Positive affect'].interpolate(method='linear', inplace=True)
result.replace = result['Negative affect'].interpolate(method='linear', inplace=True)
result['Age ratio'] = result['Age ratio'].fillna(result.groupby('Country name')['Age ratio'].transform('mean'))
result.isnull().sum()
Country name 0 year 0 Life Ladder 0 Log GDP per capita 0 Social support 0 Healthy life expectancy at birth 0 Freedom to make life choices 0 Generosity 0 Perceptions of corruption 0 Positive affect 0 Negative affect 0 Age ratio 0 dtype: int64
We shall save this dataset as clean_data.csv for Part 2.
result.to_csv("clean_data.csv",index=False)
The Project Seminar is the continuation of Part 1 which contains the following sections:-
Import additional packages required for EDA.
import matplotlib.pyplot as plt
import seaborn as sns
Read the pre-processed dataset and assign to 'happiness'.
happiness = pd.read_csv('clean_data.csv')
happiness
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | Age ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.24 | 0.531 | 0.2360 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 1 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.88 | 0.509 | 0.1040 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 2 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.20 | 0.389 | 0.0800 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 3 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.00 | 0.523 | 0.0420 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 4 | Albania | 2007 | 4.634 | 9.142 | 0.821 | 65.80 | 0.529 | -0.0090 | 0.875 | 0.553 | 0.246 | 51.604342 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2047 | United Kingdom | 2018 | 7.233 | 10.743 | 0.928 | 72.30 | 0.838 | 0.2260 | 0.404 | 0.783 | 0.228 | 56.430746 |
| 2048 | United Kingdom | 2005 | 6.984 | 10.663 | 0.979 | 69.90 | 0.922 | 0.3145 | 0.398 | 0.864 | 0.262 | 51.545791 |
| 2049 | United Kingdom | 2010 | 7.029 | 10.649 | 0.955 | 71.30 | 0.841 | 0.4030 | 0.587 | 0.863 | 0.176 | 51.665045 |
| 2050 | Zimbabwe | 2007 | 3.280 | 7.666 | 0.828 | 42.86 | 0.456 | -0.0820 | 0.946 | 0.661 | 0.265 | 79.675617 |
| 2051 | Zimbabwe | 2006 | 3.826 | 7.711 | 0.822 | 41.58 | 0.431 | -0.0760 | 0.905 | 0.715 | 0.297 | 79.694613 |
2052 rows × 12 columns
EDA analyses data sets to communicate meaning and proffer knowledge that is hidden in the data using statistical graphics and other data visualisation methods. Generally, EDA encompasses descriptive statistics and visualisation.
Our chosen methods for EDA were shown below:
Univariate analysis: Summary statistics and Histogram
Bivariate analysis: Correlation matrix and Pair plot
Multivariate analysis: Relplot
In this section, we used Sweetviz library to conduct univariate analysis. Univariate analysis involves one variable at a time and it is the simplest form of analysis.
First, we installed the sweetviz library and ran it on our 'happiness' dataset. The library will auto-generate the report in html format and save in our working directory.
!pip install sweetviz
import sweetviz as sv
analyze_report = sv.analyze(happiness)
analyze_report.show_html('analyze.html', open_browser=False)
Requirement already satisfied: sweetviz in c:\users\user\anaconda3\lib\site-packages (2.1.3) Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in c:\users\user\anaconda3\lib\site-packages (from sweetviz) (1.2.4) Requirement already satisfied: matplotlib>=3.1.3 in c:\users\user\anaconda3\lib\site-packages (from sweetviz) (3.3.4) Requirement already satisfied: scipy>=1.3.2 in c:\users\user\anaconda3\lib\site-packages (from sweetviz) (1.6.2) Requirement already satisfied: numpy>=1.16.0 in c:\users\user\anaconda3\lib\site-packages (from sweetviz) (1.20.1) Requirement already satisfied: importlib-resources>=1.2.0 in c:\users\user\anaconda3\lib\site-packages (from sweetviz) (5.7.1) Requirement already satisfied: jinja2>=2.11.1 in c:\users\user\anaconda3\lib\site-packages (from sweetviz) (2.11.3) Requirement already satisfied: tqdm>=4.43.0 in c:\users\user\anaconda3\lib\site-packages (from sweetviz) (4.59.0) Requirement already satisfied: zipp>=3.1.0 in c:\users\user\anaconda3\lib\site-packages (from importlib-resources>=1.2.0->sweetviz) (3.4.1) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\user\anaconda3\lib\site-packages (from jinja2>=2.11.1->sweetviz) (1.1.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (8.2.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (2.4.7) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (1.3.1) Requirement already satisfied: python-dateutil>=2.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (2.8.1) Requirement already satisfied: cycler>=0.10 in c:\users\user\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (0.10.0) Requirement already satisfied: six in c:\users\user\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=3.1.3->sweetviz) (1.15.0) Requirement already satisfied: pytz>=2017.3 in c:\users\user\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2021.1)
Report analyze.html was generated.
The report 'analyze.html' is stored in our working directory by default. We used IPython to depict the report here for clear overview.
from IPython.display import IFrame
IFrame(src="analyze.html", width=900, height=600)
From 'analyze.html', we could see that the pre-processed dataset contained 2,052 rows and 12 columns, including 9 independent variables and 1 target variable. The histograms showed that these variables were not evenly distributed and skewed. Hence, we might need to transform our variables before modelling phase.
Bivariate analysis examines the relationship between two variables. Here, we calculated the correlation between different variables and plotted heatmap using Seaborn library.
# Correlation between different variables
corr = happiness.corr()
# Set up the matplotlib plot configuration
f, ax = plt.subplots(figsize=(12, 8))
# Generate a mask for upper traingle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Configure a custom diverging colormap
cmap = sns.diverging_palette(200, 15, as_cmap=True)
# Configure text size
plt.rc('axes', labelsize=12) # fontsize of the x and y labels
plt.rc('xtick', labelsize=12) # fontsize of the tick labels
plt.rc('ytick', labelsize=12) # fontsize of the tick labels
plt.rc('font', size=12) # controls default text sizes
# Draw the heatmap
sns.heatmap(corr, annot=True, mask = mask, cmap='coolwarm')
<AxesSubplot:>
Our metrics for correlation were as follows:
(1.00) - (0.75) or 0.75 - 1.00 -> strong
(0.75) - (0.25) or 0.25 - 0.75 -> intermediate
(0.25) - (0.01) or 0.01 - 0.25 -> weak
The variable 'year' was to be ignored.
We had several interesting findings from the heatmap above that relate to our dependent variable, Healthy life expectancy at birth:
Diving deeper, we explored further on our variables which possessed strong or intermediate correlation, either positive or negative with healthy life expectancy at birth :
1. Log GDP per Capita (0.83)
Strongly and positively correlated with Life ladder (0.78) and Social support (0.68). Strongly and negatively correlated with Age ratio (-0.74), which conforms with our usual expectation.
2. Life Ladder (0.74)
Intermediately and positively correlated with Freedom to make life choices (0.53) and Positive affect (0.52). Interestingly, these variables including Social support did not positively affect life expectancy as much as Life ladder did, except Log GDP per capita.
3. Age Ratio (-0.74)
Strongly and negatively correlated with Log GDP per capita.
Using pairplot, we produced a descriptive visualisation similar to heatmap, but in the form of scatter plots.
sns.pairplot(happiness)
<seaborn.axisgrid.PairGrid at 0x1803902ea30>
From the scatter plots above, we could once again confirm that 'year' has no relation to whichever variables as it only served as time series in the dataset.
Focusing on the 5th row where Healthy life expectancy at birth acted as the dependent variable, we observed a few of interesting patterns:
Next, we went into details by plotting TWO variables filled by third variable, given the result from Correlation Analysis: Findings.
sns.relplot(x='Log GDP per capita', y='Healthy life expectancy at birth', hue = 'Life Ladder', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x18025b10ca0>
sns.relplot(x='Log GDP per capita', y='Healthy life expectancy at birth', hue = 'Social support', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x1804574b6d0>
sns.relplot(x='Log GDP per capita', y='Healthy life expectancy at birth', hue = 'Age ratio', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x18039598e80>
sns.relplot(x='Life Ladder', y='Healthy life expectancy at birth', hue = 'Log GDP per capita', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x18038ea52e0>
sns.relplot(x='Life Ladder', y='Healthy life expectancy at birth', hue = 'Social support', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x180395a3100>
sns.relplot(x='Life Ladder', y='Healthy life expectancy at birth', hue = 'Freedom to make life choices', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x1803859f5e0>
sns.relplot(x='Life Ladder', y='Healthy life expectancy at birth', hue = 'Positive affect', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x18037faaac0>
sns.relplot(x='Life Ladder', y='Healthy life expectancy at birth', hue = 'Age ratio', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x180386e5d00>
sns.relplot(x='Age ratio', y='Healthy life expectancy at birth', hue = 'Log GDP per capita', data = happiness)
<seaborn.axisgrid.FacetGrid at 0x1803941de20>
We observed a sudden turn of Healthy life expectancy at birth when Age ratio passed 40. This implies that when Age ratio increases, Healthy life expectancy at birth increases until the optimal value - 40. Beyond the optimum, Healthy life expectancy at birth decreases as well as Log GDP per capita and Life ladder.
Since Age ratio refers to the 'ratio of dependent population to the working population which indicates financial stress level', having financial stress level to a certain extent will motivate a person to work harder (Kilby and Sherman, 2016), encourage his or her economy growth (Log GDP per capita) and happiness (Life ladder) thus increase life expectancy.
However, when the financial level gets overwhelming, the person may suffer from physical, emotional and mental distress. He or she will lose motivation in working, hence bringing economy growth, happiness and life expectancy into suffer.
We built a Support Vector Machine for regression.
We first split the data into 2 parts:
The columns of data that we used to make classifications - X
The column of data that we wanted to predict - y
We first dropped healthy life expectancy at birth, Country name and year for our X variable as:
healthy life expectancy at birth is our dependent variable ;
the rest of the variables are categorical data because Support Vector Machine does not natively support categorical data, including them would confuse the algorithm (Dr. Josh from StatQuest).
X = happiness.drop(['Healthy life expectancy at birth','Country name','year'], axis = 1).copy()
X.head()
| Life Ladder | Log GDP per capita | Social support | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | Age ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.783 | 7.705 | 0.521 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 1 | 3.131 | 7.718 | 0.526 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 2 | 3.983 | 7.702 | 0.529 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 3 | 4.220 | 7.697 | 0.559 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 4 | 4.634 | 9.142 | 0.821 | 0.529 | -0.009 | 0.875 | 0.553 | 0.246 | 51.604342 |
Next, we created y that only contained the healthy life expectancy at birth column.
y = happiness['Healthy life expectancy at birth'].copy()
y.head()
0 52.24 1 52.88 2 53.20 3 53.00 4 65.80 Name: Healthy life expectancy at birth, dtype: float64
Support Vector Machine natively supports continuous data but not categorical machine and thus, One-Hot Encoding needs to be performed to transform the categorical data column into multiple binary data columns.
Since all of our variables were continuous data, this step was skipped.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
We proceeded with the prediction using Support Vector Machine.
from sklearn.svm import SVR
# Fitting SVM regression to the Training set
SVM_regression = SVR()
SVM_regression.fit(X_train, y_train)
SVR()
# Predicting the Test set results
y_pred = SVM_regression.predict(X_test)
predictions = pd.DataFrame({ 'y_test':y_test, 'y_pred':y_pred})
predictions.head()
| y_test | y_pred | |
|---|---|---|
| 474 | 64.60 | 64.575570 |
| 1397 | 73.90 | 68.185271 |
| 193 | 55.24 | 63.341023 |
| 1724 | 54.70 | 55.440579 |
| 536 | 65.04 | 66.091000 |
Model evaluation using plot:
sns.scatterplot(x=y_test, y=y_pred, alpha=0.6)
sns.lineplot(y_test, y_test)
plt.xlabel('Actual count')
plt.ylabel('Predicted count')
plt.title('Actual vs Predicted count (test set)')
plt.show()
C:\Users\User\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Model evaluation using numerical scores:
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("Mean ab error (linear model): {:.2f}".format(mean_absolute_error(y_test, y_pred)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_pred)))
Mean squared error (linear model): 20.14 Mean ab error (linear model): 3.33 r2_score (linear model): 0.63
KNN regression is a non-parametric regressor. It estimates the association between independent variables and target variable (in continuous data type) by averaging the entities in the same neighbourhood.
Similar to Model 1: Format the Data for SVM, we
Step 1: Split the data into independent and dependent variables;
Step 2: Split the data into training and test set;
Step 3: Transform the training and test set through scaling;
Step 4: Conduct fit and predict of modelling phase;
Step 5: Evaluation.
#KNN Regressor
df = happiness.copy()
We removed 'Country name' and 'year' columns to fit in the data requirement of KNN regressor - only continuous variables are allowed for effective modeling.
df.drop(['Country name', 'year'], axis=1, inplace=True)
We split the dataset into training and test data such that training data = 70% whereas test data = 30%.
Since 'Healthy life expectancy at birth' is our target variable, we dropped it from x_train and assigned it to y_train.
from sklearn.model_selection import train_test_split
train , test = train_test_split(df, test_size = 0.3)
x_train = train.drop('Healthy life expectancy at birth', axis=1)
y_train = train['Healthy life expectancy at birth']
x_test = test.drop('Healthy life expectancy at birth', axis = 1)
y_test = test['Healthy life expectancy at birth']
Scaling is important when one is dealing with Distance-based algorithms ranged from KNN, K-means and SVM. These algorithms are sensitive to the range of the data points. Hence, we utilised MinMaxScaler to transform independent variables - x_train and x_test.
We did not need to scale y_train or y_test as the model will set the parameter values based on transformed x_train and x_test.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train_scaled)
x_test_scaled = scaler.fit_transform(x_test)
x_test = pd.DataFrame(x_test_scaled)
After finished preparing our training and test data following the data requirement of KNN Regressor, we imported necessary packages for the process.
#import required packages
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline
We built a for loop, fit the model on our training data, made prediction on the test data, and calculated the RMSE values.
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(x_train, y_train) #fit the model
pred=model.predict(x_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print('RMSE value for k= ' , K , 'is:', error)
RMSE value for k= 1 is: 2.6838606834499745 RMSE value for k= 2 is: 2.3924566427496763 RMSE value for k= 3 is: 2.4027115799205503 RMSE value for k= 4 is: 2.512434604584795 RMSE value for k= 5 is: 2.5084109611191785 RMSE value for k= 6 is: 2.594222026032898 RMSE value for k= 7 is: 2.6628022747526665 RMSE value for k= 8 is: 2.7479866404352284 RMSE value for k= 9 is: 2.7780282033621417 RMSE value for k= 10 is: 2.780622125220788 RMSE value for k= 11 is: 2.8263129094446047 RMSE value for k= 12 is: 2.856679300831943 RMSE value for k= 13 is: 2.8708992910654594 RMSE value for k= 14 is: 2.9262581203390057 RMSE value for k= 15 is: 2.9713386176188235 RMSE value for k= 16 is: 3.0009166327483907 RMSE value for k= 17 is: 3.022710504481161 RMSE value for k= 18 is: 3.0524321343378693 RMSE value for k= 19 is: 3.072587524660396 RMSE value for k= 20 is: 3.103762869359106
We plotted an elbow curve to find out which K has the lowest RMSE score.
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()
<AxesSubplot:>
Since the elbow curve gave us an approximate value of K (between 0.0 and 2.5) with the lowest RMSE score, we used Grid Search CV to capture the best parameter value from a given set of RMSE values.
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors':[2,3,4,5,6,7]}
knn = neighbors.KNeighborsRegressor()
model = GridSearchCV(knn, params, cv=5)
model.fit(x_train,y_train)
model.best_params_
{'n_neighbors': 2}
To see if the model works, we improvised the 9 feature values to predict the 'Healthy life expectancy at birth', which equals to 69.76 years old.
print(model.predict([[9,1,1,1,1,1,1,1,1]]))
[69.76]
Out of curiosity, we plotted scatter plots for x_train and x_test such that K = 2. The four scatter plots shown below portrayed how x_train, x_test (sample) and neighbour data points interacted with each other.
def get_neighbors(df, sample, k=2):
neighbors = [(x, np.sum(np.abs(x - sample))) for x in df]
neighbors = sorted(neighbors, key=lambda x: x[1])
return np.array([x for x, _ in neighbors[:k]])
_, ax = plt.subplots(nrows=1, ncols=4, figsize=(15, 5))
for i in range(4):
sample = x_test[i]
neighbors = get_neighbors(x_train, sample, k=2)
ax[i].scatter(x_train[:, 0], x_train[:, 1], c="skyblue")
ax[i].scatter(neighbors[:, 0], neighbors[:, 1], edgecolor="green")
ax[i].scatter(sample[0], sample[1], marker="+", c="red", s=100)
ax[i].set(xlim=(-2, 2), ylim=(-2, 2))
plt.tight_layout()
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('RosieKNN.png')
imgplot = plt.imshow(img)
plt.show()
print(img.size)
Populating the interactive namespace from numpy and matplotlib
C:\Users\User\anaconda3\lib\site-packages\IPython\core\magics\pylab.py:159: UserWarning: pylab import has clobbered these variables: ['test', 'sample', 'f', 'sqrt', 'plot']
`%matplotlib` prevents importing * from pylab and numpy
warn("pylab import has clobbered these variables: %s" % clobbered +
1509376
To align with the evaluation process of 4 other different models, we ran another set of simple command of KNN Regressor to obtain the score of Mean Absolute Zero, Mean-Squared Error and R-squared.
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(n_neighbors=4)
regressor.fit(x_train, y_train)
KNeighborsRegressor(n_neighbors=4)
y_pred = regressor.predict(x_test) # test the output by changing values
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score,mean_absolute_error
print("Mean ab error (linear model): {:.2f}".format(mean_absolute_error(y_test, y_pred)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_pred)))
#print("r2_score (linear model): {:.2f}".format(accuracy_score(y_test, y_pred)))
# can edit the n value
Mean ab error (linear model): 1.56 Mean squared error (linear model): 6.31 r2_score (linear model): 0.89
#save the model
import pickle
filename = 'finalized_knn.sav'
pickle.dump(regressor, open(filename, 'wb'))
Multiple Linear Regression, MLR is the most common form of linear regression analysis. As a predictive analysis, the MLR is used to explain the relationship between one continuous dependent variable and two or more independent variables.
Steps to predict the life expectancy using MLR:
# read variables in our data
import pandas as pd
happiness = pd.read_csv('clean_data.csv')
print(list(happiness))
['Country name', 'year', 'Life Ladder', 'Log GDP per capita', 'Social support', 'Healthy life expectancy at birth', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Positive affect', 'Negative affect', 'Age ratio']
In order to perform predictions on life expectancy, our dependent variables is Healthy life expectancy at birth. Meanwhile, we will remove columns: Country name & year for better modelling results and the remaining variables were our independent variables.
Transform data into training and testing sets with the ratio of 4:1
df = happiness.copy()
df.drop(['Country name', 'year'], axis=1, inplace=True)
from sklearn.model_selection import train_test_split
train , test = train_test_split(df, test_size = 0.2)
X_train = train.drop('Healthy life expectancy at birth', axis=1)
y_train = train['Healthy life expectancy at birth']
X_test = test.drop('Healthy life expectancy at birth', axis = 1)
y_test = test['Healthy life expectancy at birth']
To standardise our attributes values, scaling on our data is necessary.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Import module:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
# Fitting the training data
LR.fit(X_train,y_train)
# Predicting the result
y_prediction = LR.predict(X_test)
import seaborn as sns
sns.scatterplot(x=y_test, y=y_prediction, alpha=0.3)
sns.lineplot(y_test, y_test)
import matplotlib.pyplot as plt
plt.xlabel('Actual count')
plt.ylabel('Predicted count')
plt.title('Actual vs Predicted count (test set)')
plt.show()
import warnings
warnings.filterwarnings('ignore')
warnings.warn('ignore')
warnings.warn('Do not show this message')
C:\Users\User\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("Mean absolute error (linear model): {:.2f}".format(mean_absolute_error(y_test, y_prediction)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_prediction)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_prediction)))
Mean absolute error (linear model): 2.53 Mean squared error (linear model): 12.46 r2_score (linear model): 0.76
Random forest is one of the supervised machine learning method that is applicable for both classification and regression problems.
Our random forest was made up of multiple decision trees, which combined the outputs from multiple decision trees to reach a single result.
Step 1: First 5 rows of dataset was selected and displayed for us understand the current situation of dataset.
Step 2: Dataset for training was prepared by shortlisting required features.
Step 3: Dataset was splited into independent and dependent variables.
Step 4: Dataset was splited into training and testing set.
Step 5: Train and test set underwent data transformation through scaling.
Step 6: Fit and predict was conducted during modelling phase.
Step 7: Refined parameters of random forest.
Step 8: Result was displayed for evaluation.
happiness.head()
| Country name | year | Life Ladder | Log GDP per capita | Social support | Healthy life expectancy at birth | Freedom to make life choices | Generosity | Perceptions of corruption | Positive affect | Negative affect | Age ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2012 | 3.783 | 7.705 | 0.521 | 52.24 | 0.531 | 0.236 | 0.776 | 0.710 | 0.268 | 97.925947 |
| 1 | Afghanistan | 2014 | 3.131 | 7.718 | 0.526 | 52.88 | 0.509 | 0.104 | 0.871 | 0.532 | 0.375 | 92.649143 |
| 2 | Afghanistan | 2015 | 3.983 | 7.702 | 0.529 | 53.20 | 0.389 | 0.080 | 0.881 | 0.554 | 0.339 | 89.954092 |
| 3 | Afghanistan | 2016 | 4.220 | 7.697 | 0.559 | 53.00 | 0.523 | 0.042 | 0.793 | 0.565 | 0.348 | 87.941788 |
| 4 | Albania | 2007 | 4.634 | 9.142 | 0.821 | 65.80 | 0.529 | -0.009 | 0.875 | 0.553 | 0.246 | 51.604342 |
Prepared happiness dataset that come with required features for training.
df = happiness.copy()
df.drop(['Country name', 'year'], axis=1, inplace=True)
from sklearn.model_selection import train_test_split
train , test = train_test_split(df, test_size = 0.2)
X_train = train.drop('Healthy life expectancy at birth', axis=1)
y_train = train['Healthy life expectancy at birth']
X_test = test.drop('Healthy life expectancy at birth', axis = 1)
y_test = test['Healthy life expectancy at birth']
Then, dataset was divided into training and testing sets through allocation of 80:20.
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', X_test.shape)
print('Testing Features Shape:', y_train.shape)
print('Testing Labels Shape:', y_test.shape)
Training Features Shape: (1641, 9) Training Labels Shape: (411, 9) Testing Features Shape: (1641,) Testing Labels Shape: (411,)
Since not all attributes in our dataset were scaled, for instance:
At the same time, other attributes had values in range of ones. Thus, we used Scikit-Learn's StandardScaler to scale our data.
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Now that we have scaled our dataset.
After that, random forest model was trained to solve our regression problem.
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(bootstrap = False,
max_depth = 100,
max_features = 'sqrt',
min_samples_leaf = 1,
min_samples_split = 2,
n_estimators = 88).fit(X_train, y_train)
y_pred = regressor.predict(X_test)
Metrics used to evaluate an algorithm were:
(i) mean absolute error --> less sensitive version of RMSE and MSE to outliers.
(ii) root mean squared error --> average error performed by the model in predicting the outcome for an observation
(ii) mean squared error --> average squared difference between the observed actual outome values and the values predicted by the model
(iii) r2_score (linear model) --> indication as how much of the variation was explained by independent variables
Evaluated performance of the first random forest attempt were summarized as below:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred).round(2))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test, y_pred)).round(2))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_pred)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_pred)))
Mean Absolute Error: 1.56 Root Mean Squared Error: 2.35 Mean squared error (linear model): 5.51 r2_score (linear model): 0.90
#save the model
import pickle
filename = 'finalized_rf.sav'
pickle.dump(regressor, open(filename, 'wb'))
Random Hyperparameter Grid
RandomizedSearchCV will be adopted to increase the performance of random forest.
To use RandomizedSearchCV, we created a parameter grid to sample from during fitting:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1000, stop = 2000, num = 20)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
print(random_grid)
{'n_estimators': [1000, 1052, 1105, 1157, 1210, 1263, 1315, 1368, 1421, 1473, 1526, 1578, 1631, 1684, 1736, 1789, 1842, 1894, 1947, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
Altogether, there were few thousands settings! However, the benefit of a random search was that we were not trying every combination, but random select from wide range of values.
Random Search Training
Random search was initiated and fitted it like any Scikit-Learn model:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 5 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5,
verbose=2, random_state=0, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
Fitting 5 folds for each of 50 candidates, totalling 250 fits
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=50,
n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100, 110,
None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [1000, 1052, 1105, 1157,
1210, 1263, 1315, 1368,
1421, 1473, 1526, 1578,
1631, 1684, 1736, 1789,
1842, 1894, 1947,
2000]},
random_state=0, verbose=2)
Detected best parameters from fitting the random search consist of parameters below:
rf_random.best_params_
{'n_estimators': 1368,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 70,
'bootstrap': False}
Evaluate Random Search
Base model and fine tuned model's performance were evaluated to assist us decide which model to select.
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
base_model = RandomForestRegressor(max_depth=10, random_state=0, n_estimators = 1600)
base_model.fit(X_test, y_test)
base_accuracy = evaluate(base_model, X_test, y_test)
Model Performance Average Error: 0.8321 degrees. Accuracy = 98.61%.
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)
Model Performance Average Error: 1.5736 degrees. Accuracy = 97.31%.
print('Decrease of {:0.2f}% performance'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))
Decrease of -1.32% performance
Performance was reduced at fine tuned model. Fine tuning didn't always lead us to better result.
A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation).
This multilayer perceptron have 3 hidden layer with 100 neuron, 65 neuron and 32 neuron.
Activation function is using ReLU.
Weight optimization is adam
L2 regularization term alpha value is 0.001
learning rate with 0.001
epoch of 200
epsilonfloat default=1e-8
df = happiness.copy()
df.drop(['Country name', 'year'], axis=1, inplace=True)
from sklearn.model_selection import train_test_split
train , test = train_test_split(df, test_size = 0.2)
X_train = train.drop('Healthy life expectancy at birth', axis=1)
y_train = train['Healthy life expectancy at birth']
X_test = test.drop('Healthy life expectancy at birth', axis = 1)
y_test = test['Healthy life expectancy at birth']
from sklearn.neural_network import MLPRegressor
regressor = MLPRegressor(random_state=1, max_iter=2000,hidden_layer_sizes=(100,65,32))
regressor.fit(X_train, y_train)
Y_pred = regressor.predict(X_test) # test the output by changing values
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, Y_pred)))
print("Mean ab error (linear model): {:.2f}".format(mean_absolute_error(y_test, Y_pred)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, Y_pred)))
Mean squared error (linear model): 14.47 Mean ab error (linear model): 2.91 r2_score (linear model): 0.74
In this section, we used R-squared, Mean Absolute Error (MAE) and Mean Squared Error (MSE) to evaluate our 5 models.
R-squared: the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.
MAE: the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
MSE: the difference between the original and predicted values extracted by squared the average difference over the data set.
import pandas as pd
df = pd.read_csv("5ML Metrices.csv")
df
| ML method | mean_absolute_error | mean_squared_error | r2_score | |
|---|---|---|---|---|
| 0 | SVM | 3.33 | 20.14 | 0.63 |
| 1 | KNN Regressor | 1.56 | 6.31 | 0.89 |
| 2 | MLR | 2.53 | 12.46 | 0.76 |
| 3 | MLP | 2.91 | 14.47 | 0.74 |
| 4 | Random Forest | 1.56 | 5.51 | 0.90 |
import matplotlib.pyplot as plt
import numpy as np
ML_method = df["ML method"]
mae = df["mean_absolute_error"]
mqe = df["mean_squared_error"]
r2 = df["r2_score"]
# Fixing random state for reproducibility
plt.rcdefaults()
fig, ax = plt.subplots()
# Example data
x_pos = ML_method
performance = mae
ax.bar(x_pos, performance, align='center')
ax.set_xticks(x_pos)
#ax.invert_yaxis() # labels read top-to-bottom
ax.set_ylabel('MAE')
ax.set_xlabel('ML model')
ax.set_title('Overall mean absolute error performance')
plt.show()
import matplotlib.pyplot as plt
import numpy as np
ML_method = df["ML method"]
mae = df["mean_absolute_error"]
mqe = df["mean_squared_error"]
r2 = df["r2_score"]
# Fixing random state for reproducibility
plt.rcdefaults()
fig, ax = plt.subplots()
# Example data
x_pos = ML_method
performance = mqe
ax.bar(x_pos, performance, align='center')
ax.set_xticks(x_pos)
#ax.invert_yaxis() # labels read top-to-bottom
ax.set_ylabel('mean_squared_error')
ax.set_xlabel('ML model')
ax.set_title('Overall mean_squared_error performance')
plt.show()
import matplotlib.pyplot as plt
import numpy as np
ML_method = df["ML method"]
mae = df["mean_absolute_error"]
mqe = df["mean_squared_error"]
r2 = df["r2_score"]
# Fixing random state for reproducibility
plt.rcdefaults()
fig, ax = plt.subplots()
# Example data
x_pos = ML_method
performance = r2
ax.bar(x_pos, performance, align='center')
ax.set_xticks(x_pos)
#ax.invert_yaxis() # labels read top-to-bottom
ax.set_ylabel('r2_score')
ax.set_xlabel('ML model')
ax.set_title('Overall r2_score performance')
plt.show()
df_metrices = df
df_metrices["MAE_rank"]= df_metrices["mean_absolute_error"].rank(method="max")
df_metrices["MQE_rank"]= df_metrices["mean_squared_error"].rank(method="max")
df_metrices["R2_rank"]= df_metrices["r2_score"].rank(ascending=False)
df_metrices
| ML method | mean_absolute_error | mean_squared_error | r2_score | MAE_rank | MQE_rank | R2_rank | |
|---|---|---|---|---|---|---|---|
| 0 | SVM | 3.33 | 20.14 | 0.63 | 5.0 | 5.0 | 5.0 |
| 1 | KNN Regressor | 1.56 | 6.31 | 0.89 | 2.0 | 2.0 | 2.0 |
| 2 | MLR | 2.53 | 12.46 | 0.76 | 3.0 | 3.0 | 3.0 |
| 3 | MLP | 2.91 | 14.47 | 0.74 | 4.0 | 4.0 | 4.0 |
| 4 | Random Forest | 1.56 | 5.51 | 0.90 | 2.0 | 1.0 | 1.0 |
import matplotlib.pyplot as plt
import numpy as np
N=5
ML_method = df_metrices["ML method"]
mae = df_metrices["MAE_rank"]
mqe = df_metrices["MQE_rank"]
r2 = df_metrices["R2_rank"]
# Position of bars on x-axis
ind = np.arange(N)
# Figure size
plt.figure(figsize=(10,5))
# Width of a bar
width = 0.3
# Plotting
plt.bar(ind, mae , width, label='Mean Absolute Error')
plt.bar(ind + width, mqe, width, label='Mean Squared Error')
plt.bar(ind + width+ width, r2, width, label='R2 Score')
plt.xlabel('Machine Learning Model')
plt.ylabel('Rank of each Metrics (Lower better)')
plt.title('Here goes title of the plot')
# xticks()
# First argument - A list of positions at which ticks should be placed
# Second argument - A list of labels to place at the given locations
plt.xticks(ind + (2*width) / 2, ML_method)
# Finding the best position for legends and putting it
plt.legend(loc='best')
plt.show()
df_metrices["Overall_score"] = df_metrices["MAE_rank"]+df_metrices["MQE_rank"]+df_metrices["R2_rank"]
df_metrices["Overall_rank"] = df_metrices["Overall_score"].rank()
df_metrices
| ML method | mean_absolute_error | mean_squared_error | r2_score | MAE_rank | MQE_rank | R2_rank | Overall_score | Overall_rank | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | SVM | 3.33 | 20.14 | 0.63 | 5.0 | 5.0 | 5.0 | 15.0 | 5.0 |
| 1 | KNN Regressor | 1.56 | 6.31 | 0.89 | 2.0 | 2.0 | 2.0 | 6.0 | 2.0 |
| 2 | MLR | 2.53 | 12.46 | 0.76 | 3.0 | 3.0 | 3.0 | 9.0 | 3.0 |
| 3 | MLP | 2.91 | 14.47 | 0.74 | 4.0 | 4.0 | 4.0 | 12.0 | 4.0 |
| 4 | Random Forest | 1.56 | 5.51 | 0.90 | 2.0 | 1.0 | 1.0 | 4.0 | 1.0 |
We wanted to sneak peak on which attribute contributed the most influence when we tried to predict the life expectancy. Thus, we used the second best model which is random forest to find out.
loaded_model = pickle.load(open("finalized_rf.sav", 'rb'))
X = happiness.iloc[:, 2:12].columns
X=np.delete(X,3)
importance = loaded_model.feature_importances_
import pandas as pd
forest_importances = pd.Series(importance, index=X)
std = np.std([tree.feature_importances_ for tree in loaded_model.estimators_], axis=0)
print(forest_importances)
fig, ax = plt.subplots()
forest_importances.plot.bar( yerr=std,ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
Life Ladder 0.155222 Log GDP per capita 0.281040 Social support 0.075897 Freedom to make life choices 0.030957 Generosity 0.026868 Perceptions of corruption 0.031626 Positive affect 0.025713 Negative affect 0.020632 Age ratio 0.352046 dtype: float64
Age ratio, Log GDP per capita and Life ladder have the most influence on life expectancy.
In summary, we ranked the variables below that have correlations with our independent variables in descending order.
Log GDP per capita, Life ladder and Age ratio influence Healthy life expectancy at birth the most whereas Negative affect and Generosity have weak correlation with the target variable.
Our evaluation metrics values on the 5 ML models were summarised as below:
| Models | MAE | MSE | R-Squared |
|---|---|---|---|
| Support Vector Machine | 3.33 | 20.14 | 0.63 |
| KNN Regressor | 1.56 | 6.31 | 0.89 |
| Multiple Linear Regression | 2.53 | 12.46 | 0.76 |
| Multi-Layer Perceptron | 2.91 | 14.47 | 0.74 |
| Random Forest | 1.56 | 5.51 | 0.90 |
The ranking of the ML model by types of evaluation metric were as below:
| Models | MAE | MSE | R-Squared |
|---|---|---|---|
| Support Vector Machine | 5th | 5th | 5th |
| KNN Regressor | 2nd | 2nd | 2nd |
| Multiple Linear Regression | 3rd | 3rd | 3rd |
| Multi-Layer Perceptron | 4th | 4th | 4th |
| Random Forest | 1st | 1st | 1st |
Both tables indicate that Random Forest worked best on the dataset, followed by KNN Regressor, Multiple Linear Regression (MLR), Multi-Layer Perceptron and Support Vector Machine (SVM). Since KNN Regressor and Random Forest shared almost similar values on R-squared and MAE, we assumed that both of them performed very well.
Since our dataset was not normally distributed, non-parametric model such as KNN Regressor and Random Forest except SVM worked better than parametric model ranged from MLP and MLR. Non-parametric model does not assume a function on the dataset, thus offering a more flexible approach to fit our dataset in (Steorts, n.d.).
In addition, KNN Regressor works better on dataset when the training data (m) is larger than the number of attributes (n) such that m>n (Varghese, 2018). In contrast, SVM works the best when m<n where there are lesser training data but more columns. When we looked at our dataset, we noticed that our m was larger than n. Therefore, SVM only ranked 5th among models during model evaluation.
Steorts, R. (n.d.). Comparison of linear regression with K-nearest neighbors.
Varghese, D. (2018). Comparative study on classic machine learning algorithms. Towards Data Science.
Summary of experience: We felt extremely grateful to have accomplished the project on exploring Happiness Indicators vs Life Expectancy. Throughout the journey, we have gained deeper understanding on data analytics processes, learnt data cleaning methods and conducted different modelling techniques on our chosen dataset. Most importantly, we met with a group of friends who are also keen in delving into the world of data science, and team-worked together.
There are some limitations and constraints in this project such as we could have explored more models and evaluation metrics for the dataset. We hope to bring this on in the future project.